Laboratory phonology uses speech data to research questions about the abstract categorical structures of phonology. This...
120 downloads
1123 Views
9MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Laboratory phonology uses speech data to research questions about the abstract categorical structures of phonology. This collection of papers broadly addresses three such questions: what structures underlie the temporal coordination of articulatory gestures; what is the proper role of segments and features in phonological description; and what structures hierarchical or otherwise - relate morphosyntax to prosody? In order to encourage the interdisciplinary understanding required for progress in this field, each of the three groups of papers is preceded by a tutorial paper (commissioned for this volume) on theories and findings presupposed by some or all of the papers in the group. In addition, most of the papers are followed by commentaries, written by noted researchers in phonetics and phonology, which serve to bring important theoretical and methodological issues into perspective. Most of the material collected here is based on papers presented at the Second Conference on Laboratory Phonology in Edinburgh, 1989. The volume is therefore a sequel to Kingston and Beckman (eds.), Papers in Laboratory Phonology /, also published by Cambridge University Press.
PAPERS IN LABORATORY PHONOLOGY SERIES EDITORS: MARY E. BECKMAN AND JOHN KINGSTON
Papers in Laboratory Phonology II Gesture, Segment, Prosody
Papers in Laboratory Phonology II Gesture, Segment, Prosody EDITED BY GERARD J. DOCHERTY Department of Speech, University of Newcastle-upon-Tyne
AND D. ROBERT LADD Department of Linguistics, University of Edinburgh
The right of the University of Cambridge to print and sell all manner of books was granted by Henry Vlll in 1534. The University has printed and published continuously since 1584.
CAMBRIDGE UNIVERSITY PRESS CAMBRIDGE NEW YORK PORT CHESTER MELBOURNE SYDNEY
Published by the Press Syndicate of the University of Cambridge The Pitt Building, Trumpington Street, Cambridge CB2 1RP 40 West 20th Street, New York, NY 10011-4211, USA 10 Stamford Road, Oakleigh, Victoria 3166, Australia © Cambridge University Press 1992 First published 1992 British Library cataloguing in publication data
Gesture, segment, prosody - (Papers in laboratory phonology v.2) 1. Phonology I. Docherty, Gerard J. II. Ladd. D. Robert 1947III. Series 414 Library of Congress cataloguing in publication data
Gesture, segment, prosody / edited by Gerard J. Docherty and D. Robert Ladd. p. cm. - (Papers in laboratory phonology; 2) Based on papers presented at the Second Conference in Laboratory Phonology, held in Edinburgh, 1989. Includes bibliographical references and index. ISBN 0 521 40127 5 1. Grammar, Comparative and general - Phonology - Congresses. I. Docherty, Gerard J. II. Ladd, D. Robert, 1947III. Conference in Laboratory Phonology (2nd: 1989: Edinburgh, Scotland) IV. Series. P217.G47 1992 414-dc20 91-6386 CIP ISBN 0 521 40127 5 Transferred to digital printing 2004
UP
Contents
List of contributors Acknowledgments
page x xiii 1
Introduction
Section A Gesture 1 An introduction to task dynamics SARAH HAWKINS
2
3
4
9
"Targetless" schwa: an articulatory analysis CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
26
Comments on chapter 2
SARAH HAWKINS
56
Comments on chapter 2
JOHN KINGSTON
60
Comments on chapter 2
WILLIAM G. BARRY
65
Prosodic structure and tempo in a sonority model of articulatory dynamics MARY BECKMAN, JAN EDWARDS, AND JANET FLETCHER
68
Comments on chapter 3
87
OSAMU FUJIMURA
Lenition of Ih/and glottal stop JANET PIERREHUMBERT AND DAVID TALKIN
Comments on chapter 4 OSAMU FUJIMURA Comments on chapters 3 and 4 LOUIS GOLDSTEIN Comments on chapters 3 and 4 IRENE VOGEL
90
117 120 124
Contents On types of coarticulation NIGEL HEWLETT AND LINDA SHOCKEY
Comments on chapter 5
128
WILLIAM G. BARRY A N D
SARAH HAWKINS
138
Section B Segment 6
An introduction to feature geometry MICHAEL BROE
7
149
The segment: primitive or derived? JOHN J. OHALA
166
Comments on chapter 7
8
9
10
G. N. CLEMENTS
183
Modeling assimilation in nonsegmental, rule-free synthesis JOHN LOCAL
190
Comments on chapter 8 KLAUS KOHLER Comments on chapter # M A R I O R O S S I
224 227
Lexical processing and phonological representation ADITI LAHIRI AND WILLIAM MARSLEN-WILSON
229
Comments on chapter 9 Comments on chapter 9
255 257
JOHN J. OHALA CATHERINE P. BROWMAN
The descriptive role of segments: evidence from assimilation FRANCIS NOLAN
261
Comments on chapter 10 Comments on chapter 10 Comments on chapter 10
BRUCE HAYES JOHN J. OHALA CATHERINE P. BROWMAN
280 286 287
11 Psychology and the segment ANNE CUTLER
12
290
Trading relations in the perception of stops and their implications for a phonological theory LIESELOTTE SCHIEFER
Comments on chapter 12
296 ELISABETH SELKIRK
313
Section C Prosody 13 An introduction to intonationalphonology D. ROBERT LADD
321 vin
Contents 14 Downstep in Dutch: implications for a model ROB VAN DEN BERG, CARLOS GUSSENHOVEN, AND TONI RIETVELD
335
Comments on chapter 14
359
NINA GRONNUM
15 Modeling syntactic effects on downstep in Japanese HARUO KUBOZONO
Comments on chapters 14 and 15
368 MARY BECKMAN A N D
JANET PIERREHUMBERT
387
16 Secondary stress: evidence from Modern Greek AMALIA ARVANITI
398
References Name index Subject index
424 452 457
Contributors
Department of Linguistics, University of Cambridge
AMALIA A R V A N I T I WILLIAM
G.
Department of Phonetics and Linguistics, University College, London
BARRY
Department of Linguistics, Ohio State University
MARY BECKMAN
ROB VAN DEN BERG
Department of Linguistics, University of Edinburgh
M I C H A E L BROE CATHERINE
G. N.
P.
Instituut voor Fonetiek, Katholieke Universiteit, Nijmegen
BROWMAN
Haskins Laboratories
Department of Modern Languages and Linguistics, Cornell University
CLEMENTS
ANNE C U T L E R
MRC Applied Psychology Unit
JAN E D W A R D S
Hunter College of Health Sciences
JANET FLETCHER
Speech, Hearing, and Language Research Centre, Macquarie University
OSAMU F U J I M U R A
Department of Speech and Hearing Science, Ohio State University
LOUIS G O L D S T E I N
Department of Linguistics, Yale University
Contributors
NINA GR0NNUM (formerly Thorsen) C A R L O S GUSSENHOVEN
Instituut Engels-Amerikaans, Katholieke Universiteit, Nijmegen
Department of Linguistics, University of Cambridge
SARAH H A W K I N S
BRUCE HAYES
Institut for Fonetik, Copenhagen
Department of Linguistics, University of California, Los
Angeles NIGEL HEWLETT
School of Speech Therapy, Queen Margaret College
JOHN KINGSTON
Department of Linguistics, University of Massachusetts Institut fur Phonetik, Kiel
K L A U S KOHLER
HARUO KUBOZONO
D. ROBERT LADD
Department of British and American Studies, Nanzan University Department of Linguistics, University of Edinburgh
Max Planck Institut fur Psycholinguistik, Nijmegen
ADITI LAHIRI J O H N LOCAL
Department of Language, University of York
WILLIAM MARSLEN-WILSON
F R A N C I S NOLAN
JOHN
Christian-Albrechts-Universitat,
J. OHALA
Department of Linguistics, University of Cambridge Department of Linguistics, University of Alberta and Department of Linguistics, University of California, Berkeley
JANET PIERREHUMBERT
TONI RIETVELD
Department of Psychology, Birkbeck College
Department of Linguistics, Northwestern University
Instituut voor Fonetiek, Katholieke Universiteit, Nijmegen
Contributors MARIO ROSSI Institut de Phonetique, Universite de Provence LIESELOTTE SCHIEFER
Institut fur Phonetik und Sprachliche
Kommunikation, Universitdt Munchen ELISABETH SELKIRK
L I N D A SHOCKEY
DAVID TALKIN
IRENE VOGEL
Department of Linguistics, University of Massachusetts
Department of Linguistic Science, University of Reading
ATT Bell Labs Department of Linguistics, University of Delaware
xu
A cknowledgments
The Second Conference on Laboratory Phonology, on which this book is based, was made possible by the financial and organizational support of a number of people and institutions. We received outside financial support from IBM (UK), British Telecom, and the Scottish Development Agency, and - within the university - contributions both financial and material from the Centre for Cognitive Science and the Centre for Speech Technology Research. The advice and assistance of the university's very professional Conference Centre was invaluable. We were also fortunate to have a number of enthusiastic student assistants at the conference, who helped us and the participants with everything from tea breaks and photocopies to taxis and overseas phone calls: they were Hazel Sydeserff (prima inter pares), Tina Barr, Keith Edwards, Edward Flemming, Yuko Kondo, Michael Johnston, Mark Schmidt, and Eva Schulze-Berndt. We also thank Ethel Jack, the Linguistics Department secretary and one of the unsung heroes of British linguistics, for keeping track of our finances and generally ensuring that matters stayed under control. The task of making the collection of papers and commentaries from two dozen different authors into a presentable manuscript was made immeasurably easier by the patient assistance of Keith Edwards. Among other things, he prepared the unified list of references and cross-checked the references in the individual contributions; he also helped in creating the index. If we had not had such a competent and dedicated editorial assistant it would certainly have taken far longer for this volume to see the light of day. We are also grateful for the advice and assistance we have received from Marion Smith and Jenny Potts at Cambridge University Press, and for a grant from the Centre for Speech Technology Research which helped defray the costs of publication. For their services as referees we are grateful to Tom Baer, Nick Clements, Jonathan Dalby, Bill Hardcastle, John Harris, Pat Keating,
A cknowledgmen ts
Ailbhe Ni Chasaide, Stefanie Shattuck-Hufnagel, Kim Silverman, and Ken Stevens. We would also like to thank an anonymous Cambridge University Press reviewer for very helpful comments on an earlier version of the manuscript as a whole. As with any conference, our greatest debt is to the participants, who held forth and listened and discussed and ultimately made the conference what it was. We are pleased to have been part of the development of what appears to be a successful series of conferences, and, more importantly, what appears to be a productive approach to learning about the sound patterns of language. Gerard J. Docherty D. Robert Ladd
Introduction
The Second Conference on Laboratory Phonology was organized by the Department of Linguistics at the University of Edinburgh and took place in Edinburgh from 30 June to 3 July 1989. The conference's primary aim was to further the general intellectual agenda set by the ambitiously named First Conference, which brought together researchers in phonological theory and experimental phonetics to discuss the increasing convergence of their interests. An important secondary aim was to bring together researchers from both sides of the Atlantic: whereas the first conference was almost exclusively American, the second had significant delegations from several European countries and Japan as well as the USA. This book is the record of the second conference. We say "record" rather than "proceedings" because the papers collected here are neither all nor only those presented in Edinburgh. As in the first conference, the main papers were circulated in advance and invited discussants gave prepared comments at the conference; this format is reflected in the organization of this volume as a series of chapters with commentaries. However, all of the main papers were formally refereed after the conference, and have been revised, in the light of both referees' comments and discussion at the conference itself, for publication in their present form. Moreover, for a variety of reasons, quite a number of contributions to the conference (listed at the end of this introduction) do not appear in the volume, and as a result, the volume's organization is rather different from that of the conference program. We have grouped the chapters into three main sections, which we have called Gesture (on temporal coordination of articulatory gestures), Segment (on the nature and classification of segments), and Prosody (on certain aspects of the prosodic organization of speech). At the beginning of each section we have included a tutorial chapter, presenting a synopsis of recent
Introduction
theoretical developments that are presupposed in some or all of the papers in the section, which we hope will make the volume accessible to a wider range of readers. This reorganization means that certain contributions appear in a context rather different from that in which they were presented at the conference: for example, the chapter by Cutler was originally prepared as a general commentary on a series of papers dealing with "The Segment," and the commentary by Vogel on Pierrehumbert and Talkin's paper was originally prepared as a general commentary on a series of papers dealing with prosody and prosodic effects on segmental realization. We hope that in imposing the organization we have chosen we have nevertheless succeeded in preserving the sense of productive debate that was present at the conference. Cutting across the organization into three subject areas we see three major issues: the methodology and design of laboratory research on phonology; the psychological reality of lexical and phonological representations; and the nature of the phonology-phonetics "interface." On the question of methodology and design, the papers collected here seem to show a growing consensus. Fujimura finds Pierrehumbert and Talkin's paper an exemplary case of how laboratory phonology should be carried out, but the rigor exhibited in their experimental descriptions is seen in several of the papers, such as Lahiri and Marslen-Wilson's, Beckman, Edwards, and Fletcher's, Browman and Goldstein's, and Nolan's. We feel that papers such as these set the baseline for future work in laboratory phonology. Fujimura, Vogel, and Cutler all note that if phonology is to be successfully tested in the laboratory, scientific rigor is essential. This applies to the use of terminology as well as to laboratory practice; some specific problems in this area are raised in the contributions by Vogel and by Hewlett and Shockey. The second theme - the question of psychological reality of lexical representations - is explicitly addressed only in the papers by Lahiri and Marslen-Wilson and by Cutler, but we feel that it is an underlying issue throughout the volume. Lahiri and Marslen-Wilson, in an approach that departs from much recent work in laboratory phonology, use listeners' ability to recognize sounds in a gating-paradigm experiment to argue for abstractness in the lexical representation. Cutler reviews the considerable body of evidence for the psycholinguistic relevance of the segment. Both papers raise a fundamental problem: to what extent is it possible to relate models of phonological and phonetic representation, which are at the center of the debate in the rest of this volume, to what speakers actually do when they are producing an utterance? Lahiri and Marslen-Wilson suggest that psycholinguistic evidence can be used successfully to empirically evaluate the claims of theoretical phonology, whereas Cutler suggests that phonology and psycholinguistics may be fundamentally different exercises, and points to the "orthogonality of existing psycholinguistic research to phonological issues."
Introduction
Some resolution of this question is obviously crucial for any approach to phonology that looks to the laboratory and to detailed observations of speech production for evidence and evaluation. The central issue in laboratory phonology, however, is and remains the long-standing problem of the relation between phonetics and phonology. Where does one end and the other begin - or, to be more specific, can phonetics be attributed entirely to neuromotor aspects of the vocal mechanism, or is there a case for inclusion of a phonetic implementation component in the grammar? The papers on gestural coordination and "task dynamics" in the first section of the volume directly address the general issue here. These papers ask whether a gesture-based phonology couched in terms of task dynamics (the theory of skilled movement developed at Haskins Laboratories and presented here in a tutorial chapter by Hawkins) provide a basis for understanding the phonology-phonetics interface. Browman and Goldstein, and to some extent also Beckman, Edwards, and Fletcher, argue that this is a powerful approach, but notes of caution are sounded by both Fujimura and Kingston in their commentaries on these papers. Kingston, for example, while applauding the descriptive elegance of Browman and Goldstein's gestural phonology, raises serious doubts about its explanatory adequacy; he claims that the model is too powerful and lacks the constraints that would permit it to predict only those things that are found to occur in speech. More specific aspects of the phonology-phonetics interface question are also raised by the papers in the first section of the volume. Hawkins, in her commentary on Browman and Goldstein, comments on the need to be specific about which aspects of the realization of an utterance can be attributed to the gestural score, and which to the model of motor control. For example, should the gestural score contain all the language-particular phonetic characteristics of an utterance? If so, this would involve incorporating a large amount of (redundant) phonetic detail into the gestural score. If not, it has yet to be demonstrated how such aspects of phonetic realization could arise from a task-dynamics model of motor control. A related issue is the degree of specification associated with the spatial and temporal targets in the gestural score. Browman and Goldstein propose that certain aspects of the gestural score can be left unspecified and the detailed instantiation of any particular target determined by the task-dynamics model. In a somewhat different approach to a comparably complex problem of phonetic variability, the papers by Beckman, Edwards, and Fletcher and by Pierrehumbert and Talkin suggest that a notion of "prosodic modulation" - effects of phraselevel and word-level prosodic structure on the laryngeal component in the production of consonants - may make it unnecessary to posit a large number of separate rules for superficially independent types of phonetic variability.
Introduction
In the second section of the volume the question of the phonologyphonetics "interface" is attacked on two further fronts. First, papers by Local and Nolan (with associated commentaries by Hayes, Browman, Kohler, and Ohala) deal with assimilation: to what extent is assimilation the result of a phonological "rule" rather than a phenomenon emerging from the organization and coordination of articulator variables in the execution of an utterance? The key data discussed in these papers are instrumental measurements showing that assimilations are commonly no more than partial; the theoretical models which are brought to bear on such findings by Local, Hayes, and Kohler are extremely varied, and themselves make a number of predictions which are open to future empirical investigation. The second question addressed here, in the papers by Ohala and Local (and the comments by Clements, Kohler, and Rossi), is whether it is justified to posit a role for the segment in a phonological representation. This is not a new issue, but as Broe points out in his tutorial chapter, the power of nonsegmental representations in accounting for the sound pattern of languages has been recognized by more and more investigators over the last couple of decades. Finally, in the third section of the volume, we see instrumental phonetic evidence treated as central in the search for appropriate phonological descriptions - for example, in Arvaniti's paper on the phenomena to be accounted for in any phonological description of Greek stress. We feel that the central role accorded to instrumental data in these papers has somewhat equivocal implications for laboratory phonology in general. It could be seen as pointing the way to a new conception of the phonology-phonetics interface, or it could simply show that segmental and prosodic phenomena really are different. That is, phonological descriptions of prosodic phenomena such as intonation have tended (in the absence of a pretheoretical descriptive construct like the orthographically based segment) to differ from one another in ways that cannot be resolved by reference to common conceptions of phonology and phonetics. Only with the application of instrumental evidence to questions of phonological organization rather than speech production have we begun to approach some consensus; as Beckman and Pierrehumbert note in their commentary, the modeling of high tones (fundamental-frequency peaks) is "one of the success stories of laboratory phonology." It remains to be seen how widely this success can be extended into the realm of gestures and segments.
Introduction Conference contributions not included in this volume Papers Lou Boves, "Why phonology and speech technology are different" Jean-Marie Hombert, "Nasal consonants and the development of vowel nasalization" Brian Pickering and John Kelly, "Tracking long-term resonance effects" Elisabeth Selkirk and Koichi Tateishi, "Syntax, phrasing, and prominence in the intonation of Japanese" Commentaries Stephen R. Anderson Gosta Bruce Stephen D. Isard Bjorn Lindblom Joan Mascaro Posters Anne Cutler Janet Fletcher Sarah Hawkins Jill House Daniel Recasens Jacques Terken Ian Watson
Section A Gesture
I
An introduction to task dynamics SARAH HAWKINS
1.1 Motivation and overview
The aim of this paper is to describe for the nonspecialist the main features of task dynamics so that research that uses it can be understood and evaluated more easily.* Necessarily, there are some omissions and simplifications. More complete accounts can be found in the references cited in the text, especially Saltzman (1986) and Saltzman and Munhall (1989); Browman and Goldstein (1989, 1990) offer clear descriptions that focus more on the phonologically relevant aspects than the mathematical details. The taskdynamic model is being developed at the same time as it is being used as a research tool. Consistent with this paper's purpose as a general introduction rather than a detailed critique, it mainly describes the current model, and tends not to discuss intentions for how the model should ultimately work or theoretical differences among investigators. Task dynamics is a general model of skilled movement control that was developed originally to explain nonspeech tasks such as reaching and standing upright, and has more recently been applied to speech. It is based on general biological and physical principles of coordinated movement, but is couched in dynamical rather than anatomical or physiological terms. It involves a relatively radical approach that is more abstract than many more traditional systems, and has proved to be a particularly useful way of analyzing speech production, partly because it breaks complex movements down into a set of functionally independent tasks. Task dynamics describes movement in terms of the tasks to be done, and the dynamics involved in doing them. A single skilled movement may involve *I thank Thomas Baer, Catherine Browman, and Elliot Saltzman for helpful comments on earlier versions of this paper.
Gesture
several discrete, abstract tasks in this model. Speech requires a succession of skilled movements, each of which is modeled as a number of tasks. For example, to produce English [J], the precise action of the tongue is critical, the lips are somewhat rounded and protruded, and the vocal folds are moved well apart to ensure voicelessness and a strong flow of air. In addition to controlling respiratory activity, therefore, there may be at least five distinct tasks involved in saying an isolated [J]: keeping the velopharyngeal port closed, and controlling the degree of tongue constriction and its location, the degree of lip protrusion, the size of the lip aperture, and the size of the glottal aperture. The same articulator may be involved in more than one task. In this example, the jaw contributes to producing both the correct lip aperture and the correct tongue constriction. To add to the complexity, when the [J] is spoken as part of a normal utterance, each of its tasks must be carried out while the same articulators arefinishingor starting tasks required for nearby sounds. The sort of complicated tasks we habitually do with little conscious effort reaching for an object, lifting a cup to the mouth without spilling its contents, speaking - are difficult to explain using physiological models in which the angles of the joints involved in the movement are controlled directly. These models work well enough when there is only one joint involved (a "singledegree-of-freedom" task), because such movements are arc-shaped. But even simple tasks usually involve more than one joint. To bring a cup to the mouth, for example, the shoulder, elbow and wrist joints are used, and the resultant movement trajectory is not an arc, but a more-or-less straight line. Although describing how a straight-line trajectory could be controlled sounds like a simple problem, it turns out to be quite complicated. Task dynamics offers one way of modeling quasi-straight lines (Saltzman and Kelso 1987). A further complication is the fact that in these skilled "multi-degree-offreedom" tasks (involving more than one joint) the movement trajectory usually has similar characteristics no matter where it is located in space. When reaching outwards for an object, the hand tends to move in a straight line, regardless of where the target is located with respect to the body before the movement begins, and hence regardless of whether the arm starts by reaching away from the body, across the body, or straight ahead. This ability to do the same task using quite different joint angles and muscle contractions is a fundamental characteristic of skilled movement, and has been called "motor equivalence" (Lashley 1930; Hebb 1949). One important property related to motor equivalence is immediate compensation, whereby if a particular movement is blocked, the muscles adjust so that the movement trajectory continues without major disruption to attain the final goal. Immediate compensation has been demonstrated in 10
1 Sarah Hawkins
"perturbation" experiments, in which a moving articulator - usually an elbow, hand, lip, or jaw - is briefly tugged in an unpredictable way so that the movement is disrupted (e.g. Folkins and Abbs 1975; Kelso et al. 1984). The articulators involved in the movement immediately compensate for the tug and tend to return the movement to its original trajectory by adjusting the behavior of the untugged as well as the tugged articulators. Thus, if the jaw is tugged downwards during an upward movement to make a [b] closure, the lips will compensate by moving more than usual, so that the closure occurs at about the same time in both tug and no-tug conditions. These adjustments take place more or less immediately (15-30 msec, after the perturbation begins), suggesting an automatic type of reorganization rather than one that is under voluntary, attentional control. Since speaking is a voluntary action, however, we have a reflexive type of behavior within a clearly nonreflexive organization. These reflexive yet flexible types of behavior are called "functional reflexes." Task dynamics addresses itself directly to explaining both the observed quasi-straight-line movement trajectories and immediate compensation of skilled limb movements. Both properties can be reasonably simply explained using a model that shapes the trajectories implicitly as a function of the underlying dynamics, using as input only the end goal and a few parameters such as "stiffness," which are discussed below. The model automatically takes into account the conditions at the start of the movement. The resulting model is elegant and uses simple constructs like masses and springs. However, it is hard to understand at first because it uses these constructs in highly abstract ways, and requires sophisticated mathematics to translate the general dynamical principles into movements of individual parts of the body. Descriptions typically involve a number of technical terms which can be difficult to understand because the concepts they denote are not those that are most tangible when we think about movement. These terms may seem to be mere jargon at first sight, but they have in fact been carefully chosen to reflect the concepts they describe. In this paper, I try to explain each term when I first use it, and I tend to use it undefined afterwards. New terms are printed in bold face when they are explained. 1.2 Basic constructs
The essence of task dynamics, making it distinct from other systems, is implicit in its name. It describes movement in terms of the tasks to be done, using dynamics that are specific to the task but not to the parts of the body that are doing the task. A task is generally a gesture involving the control of a single object, or abstract representation of the actual thing being controlled. For example, the object represents in abstract terms the position of a hand 11
Gesture
relative to a target in a reaching task, or an articulatory constriction in a speech task. Such "objects" are called task variables; they describe the type of task required. In order to realize the movement, the abstract disembodied task must be converted into a set of parameters appropriate for the part of the body that will perform the task, and finally into movements of the actual articulators. Currently, only one type of task (described below) is modeled for speech, so most recent discussions make no effective distinction between task variables, which are associated with the disembodied task, and the variables associated with specific body parts, which are called tract variables, as in vocal-tract variables (sometimes, local tract variables). In current formulations, tract variables are the dimensions allowing particular vocaltract constrictions to be specified. In a reaching task, it is the position of the target that is being reached for that is most important, so the task is defined on a system of coordinates with the task target at the common origin. The position of the abstract hand with respect to the target is the variable that is being controlled. In speech, similarly, the task is defined in terms of the location (place) and crosssectional area (degree) of an ideal constriction. This nonspecific task is then transformed appropriately for a specific vocal-tract constriction. For example, the degree of openness suitable for some lip configuration is regarded in the task-dynamic model as a requirement to achieve a particular lip aperture, not as a requirement for lips and jaw to be in a certain position. The lips, in this case, are the terminal devices, or end effectors, since their position directly defines the lip aperture; the upper and lower lips and the jaw together form the effector system - the set of organs that includes the terminal device and gets it to the right place at the right time. Lip aperture itself is a tract variable. Thus, whereas earlier models use coordinates of either body space or articulator space to specify the mathematics of movement, and so describe movement in terms of the position of a physical object with respect to the body or in terms of the angles of joints respectively, task dynamics begins by defining movement in terms of an abstract task space, using spatial coordinates and equations of motion that are natural for the task, rather than for the particular parts of the body that are performing the task. Coordinating the different parts of the body to produce the movement is done entirely within the model. In summary, to carry out the transformations between task, body, and articulator spaces, the task-dynamic model translates each task-specific equation into equivalent equations that apply to motion of particular parts of the body. The first transformation is from the abstract task space to a more specific (but still relatively abstract) body space. The task-space equation is defined solely in terms of the type of task: the place and degree of 12
1 Sarah Hawkins
an unspecified constriction. The transformation into body space specifies the actual task: the tract variable involved, which for speech is a specific constriction such as lip aperture or the location of the constriction formed by the tongue dorsum. The second transformation is from body space to articulator space, or from the tract variables to the articulators involved. This transformation is complicated since in speech there are usually several articulators comprising the effector system of a single tract variable. There are hence many degrees of freedom, or many ways for the component articulators to achieve the same movement trajectory. Detailed accounts of all these transformations are given in Kelso, Saltzman, and Tuller (1986a, 1986b), Saltzman (1986), and Saltzman and Munhall (1989). This way of structuring the model allows immediate compensation to be accounted for elegantly. Control is specified at the tract-variable level. From this level there is an automatic mapping to the equations that govern the motions of individual articulators. Since the mapping is constantly updated, small perturbations of an articulator will be immediately and automatically adjusted for in response to the demands of the tract-variable goal governing the effector system; no explicit commands for adjustment are necessary, so compensation will take place with the very short time lags that are typically observed. Speech movements are characterized using abstract, discrete gestures. A gesture is defined as one of a family of movement patterns that are functionally equivalent ways of achieving the same goal, such as bilabial closure. A major distinction between gesture and movement is that gestures can only be defined with reference to the goal or task, whereas movements need not be linked to specific tasks. One or more tract variables contribute to a gesture. So, for speech, gestures are defined in terms of vocal-tract constrictions. Velic and glottal gestures are modeled as deriving directly from, respectively, the velic and glottal aperture tract variables. In each case, a single dynamical equation specifies the gesture. Lingual gestures depend on two tract variables, so each one is specified by two dynamical equations, one for the place and one for the degree of constriction. To make a gesture, the activities of all the articulators that contribute to the relevant tract variables are coordinated. That is, the components of the effector systems work as functional units, specified by the dynamical equations. The zero lip aperture needed to produce a [b] closure is achieved by some combination of raising the jaw and lower lip and lowering the upper lip. In a sequence like [aba], the jaw will be in a low (nonraised) position to achieve the open aperture for the two [a]s, whereas it will be much higher for the close vowels of [ibi]. Consequently, the lips are likely to show different patterns of activity to achieve the aperture for [b] in the two sequences. The task is the same for [b] in both [aba] and [ibi], and the tract variables (and 13
Gesture
hence the effector systems and terminal devices) are the same. But the physical details of how the task is achieved differ (see e.g. Sussman, MacNeilage, and Hanson 1973). The observed movements are governed by the dynamics underlying a series of discrete, abstract gestures that overlap in time and space. At its present stage of development, the task-dynamic model does not specify the sequencing or relative timing of the gestures, although work is in progress to incorporate such intergestural coordination into the model (Saltzman and Munhall 1989). The periods of activation of the tract variables are thus at present controlled by a gestural score which is either written by the experimenter or generated by rule (Browman et al 1986). In addition to giving information about the timing of gestures, the gestural score also specifies a set of dynamic parameters (such as the spatial target) that govern the behavior of each tract variable in a gesture. Thus the gestural score contains information on the relative timing and dynamic parameters associated with the gestures in a given utterance; this information is the input to the task-dynamic model proper, which determines the behavior of individual articulators. In this volume, Browman and Goldstein assign to the gestural score all the language-particular phonetic/phonological structure necessary to produce the utterance. The system derives part of its success from its assumption that continuous movement trajectories can be analyzed in terms of discrete gestures. This assumption is consistent with the characterization of speech as successions of concurrently occurring tasks described above. Two important consequences of using gestures are, first, that the basic units of speech are modeled as movements towards targets rather than as static target positions, and second, that we can work with abstract, essentially invariant units in a way that produces surface variability. While the gestures may be invariant, the movements associated with them can be affected by other gestures, and thus vary with context. Coarticulation is thus modeled as inherent within the speech-production process. 1.3 Some details of how the model works
The original ideas for task dynamics grew out of work on movement control in many laboratories, especially in the Soviet Union and the USA (see Bernstein 1967; Greene 1971), and were later developed in the context of action theory (e.g. Fowler et al 1980; Kelso et al 1980; Kelso and Tuller 1984). More recently, Saltzman (1986) has contributed the equations that allow the model to be implemented for speech. (For applications to limb movement, see Saltzman and Kelso [1987].) The equations involve math14
1 Sarah Hawkins
ematics whose details few nonspecialists are likely to understand - 1 certainly do not. The important thing from the point of view of understanding the significance of task dynamics is that the equations determine the possible movement trajectories for each tract variable, or functional grouping of articulators, and also how these trajectories will combine for each gesture (grouping of tract variables) and for each context of overlapping gestures. In general, motion associated with a single tract variable is governed by means of a differential equation, and matrix transformations determine how the movements for component articulators contribute to this motion. The type of differential equation that is used in modeling movement depends on the characteristics of that movement. For speech, the behavior of a tract variable is viewed by task dynamics as a movement towards a spatial target. This type of task is called a point-attractor task. It is modeled, for each tract variable, with a differential equation that describes the behavior of a mass connected to a spring and a damper. Movements of the mass represent changes in the value of the tract variable (e.g. changes in lip aperture). The mass is entirely abstract and is considered to be of constant size, with an arbitrary value of one. The spring can be thought of as pulling the tract variable towards it target value. The resting or equilibrium position of the spring represents the target for the tract variable and the mass moves because the spring tries to reach equilibrium. It is as if one end of the spring is attached to the mass, and the other end is moved around in space by the gestural score to successive target locations. When one end is held at a target location, the mass will begin to move towards that location, because the spring will begin to move towards its equilibrium position and drag the mass with it. When an undamped mass-spring system is set into motion, the result is a sinusoidal oscillation. Damping introduces decaying oscillations or even nonoscillatory movement, and can have an enormous effect on movement patterns. The type of damping introduced into the mass-spring system affects the duration and trajectory of a movement. Saltzman and Munhall (1989: 346, 378) imply that the damping coefficient for a tract variable (as well as all other parameters except mass), will vary with the phonetic category of the sound being produced, at least when estimated from actual speech data. While the nature of damping is being actively investigated, in practice only one form, critical damping, is used for nonlaryngeal gestures at present. (Laryngeal gestures are undamped.) In a critically damped mass-spring system, the mass does not oscillate sinusoidally, but only asymptotes towards the equilibrium (or target) position. In other words, the mass moves increasingly slowly towards the target and never quite reaches it; there is no physical oscillation around the target, and, except under certain conditions, 15
Gesture
no "target overshoot." One consequence of using critically damped trajectories is that the controlled component of each gesture is realized only as a movement towards the target. The assumption of constant mass and the tendency not to vary the degree of damping mean that we only need consider how changes in the state of the spring affect the pattern of movement of a tract variable. The rate at which the mass moves towards the target is determined by how much the spring is stretched, and by how stiff the spring is. The amount of stretch, called displacement, is the difference between the current location of the mass and the new target location. The greater the displacement, the greater the peak velocity (maximum speed) of movement towards equilibrium, since, under otherwise constant conditions, peak velocity is proportional to displacement. A stiff spring will move back to its resting position faster than a less stiff spring. Thus, changes in the spring's stiffness affect not only the duration of a movement, but also the ratio of its peak velocity to peak displacement. This ratio is often used nowadays in work on movement. Displacement in the model can be quite reasonably related to absolute physical displacement, for example to the degree of tongue body/jaw displacement in a vocalic opening gesture. Stiffness, in contrast, represents a control strategy relevant to the behavior of groups of muscles, and is not directly equatable to the physiological stiffness of individual muscles (although Browman and Goldstein [1989] raise the possibility that there may be a relationship between the value of the stiffness parameter and biomechanical stiffness). In phonetic terms, changes in stiffness affect the duration of articulatory movement. Less stiffness results in slower movement towards the target. The phonetic result depends partly on whether there are also changes in the durations and relative timing of the gestures. For example, if there were no change in gestural activation time when stiffness was reduced, then the slower movement could result in gestures effectively undershooting their targets. In general, stiffness can be changed within an utterance to affect, for example, the degree of stress of a syllable, and it can be changed to affect the overall rate of speech. Changes in displacement (stretch) with no change in stiffness, on the other hand, affect the maximum speed of movement towards the target, but not the overall duration of the movement. To generate movement trajectories, the task-dynamic model therefore needs to know, for each relevant task variable, the current state of the system (the current position and velocity of the mass), the new target position, and values for the two parameters representing, respectively, the stiffness of the hypothetical spring associated with the task variable and the type of friction (damping) in the system. The relationships between these various parameters 16
1 Sarah Hawkins
are described by the following differential equation for a damped massspring system. mx + bx + k(x - xo) = 0 where:
m= b= k= xo = x=
mass associated with the task variable damping of the system stiffness of the spring equilibrium position of the spring (the target) instantaneous value of the task variable (current location of the mass) x = instantaneous velocity of the task variable x = instantaneous acceleration of the task variable (x - xo) = instantaneous displacement of the task variable
Since, in the model, the mass m has a value of 1.0, and the damping ratio, b/(2*[mk]1/2), is also usually set to 1.0, then once the stiffness k and target xo are specified, the equation can be solved for the motion over time of the task variable x. (Velocity and acceleration are the first and second time derivatives, respectively, of x, and so can be calculated when the time function for x is known.) Solving for x at each successive point in time determines the trajectory and rate of movement of the tract variable. Any transient perturbation of the ongoing movement, as long as it is not too great, is immediately and automatically compensated for because the point-attractor equation specifies the movement characteristics of the tract variable rather than of individual articulators. Matrix transformations in the task-dynamic system determine how much each component articulator contributes to the movement of the tract variable. These equations use a set of articulator weightings which specify the relative contributions of the component articulators of the tract variable to a given gesture. These weightings comprise an additional set of parameters in the task-dynamic model. They are gesture-specific, and so are included in the gestural score. The gestural score thus specifies all the gestural parameters: the equilibrium or target position, xo; the stiffness, k; the damping ratio that, together with the stiffness, determines the damping factor b; and the articulator weightings. It also specifies how successive gestures are coordinated in time. As mentioned above, the issue of how successive gestures are coordinated in time is difficult and is currently being worked on (Saltzman and Munhall 1989). A strategy that has been used is to specify the phase in one gesture with respect to which a second gesture is coordinated. The 17
Gesture
definition of phase for these purposes has involved the concept of phase space - a two-dimensional space in which velocity and displacement are the coordinates (Kelso and Tuller 1984). Phase space allows a phase to be assigned to any kind of movement, but for movements that are essentially sinusoidal, phase has its traditional meaning. So, for example, starting a second gesture at 180 degrees in the sinusoidal movement cycle of a first gesture would mean that the second one began when the first had just completed half of its full cycle. But critically damped movements do not lend themselves well to this kind of strategy: under most conditions, they never reach a phase of even 90 degrees, as defined in phase space. Browman and Goldstein's present solution to this problem uses the familiar notion of phase of a sinusoidal movement, but in an unusual way. They assume that each critically damped gesture can also be described in terms of a cycle of an underlying undamped sinusoid. The period of this undamped cycle is calculated from the stiffness associated with the particular gesture; it represents the underlying natural frequency of the gesture, whose realization is critically damped. Two gestures are coordinated in time by specifying a phase in the underlying undamped cycle for each one, and then making those (underlying) phases coincide in time. For example, two gestures might be coordinated such that the point in one gesture that is represented by an underlying phase of 180 degrees coincided in time with the point represented by an underlying phase of 240 degrees in the other. This approach differs from the use of phase relationships described above in that phase in a gesture does not depend on the actual movement associated with the gesture, but rather on the stiffness (i.e. underlying natural frequency) for that gesture. This approach, together with an illustration of critical damping, is described in Browman and Goldstein (1990: 346-8). Coordinating gestures in terms of underlying phases means that the timing of each gesture is specified intrinsically rather than by using an external clock to time movements. Changing the phases specified by the gestural score can therefore affect the number of gestures per unit time, but not the rate at which each gesture is made. If no changes in stiffness are introduced, then changing the phase specifications will change the amount of overlap between gestures. In speech, this will affect the amount of coarticulation, or coproduction as it is often called (Fowler 1980), which can affect the style of speech (e.g. the degree of casualness) and, indirectly, the overall rate of speech. The discussion so far has described how movement starts - by the gestural score feeding the task-dynamic model with information about gestural targets, stiffnesses, and relative timing - but what causes a movement to stop has not been mentioned. There is no direct command 18
1 Sarah Hawkins
given to stop the movement resulting from a gesture. The gestural score specifies that a gesture is either activated, in which case controlled movement towards the target is initiated, or else not activated. When a gesture is no longer activated, the movements of the articulators involved are governed by either of two factors: first, an articulator may participate in a subsequent gesture; second, each articulator has its own inherent rest position, described as a neutral attractor, and moves "passively" towards this rest position whenever it is not involved in an "actively" controlled tract variable that is participating in a gesture. This rest position should not be confused with the resting or equilibrium position that is the target of an actively controlled tract variable and is specified in the gestural score. The inherent rest position is specific to an articulator, not a tract variable, and is specified by standard equations in the task-dynamic model. It may be language-specific (Saltzman and Munhall 1989) and hence correspond to the "base of articulation" - schwa for English - in which case it seems possible that it could also contribute towards articulatory setting and thus be specific to a regional accent or an individual. A factor that has not yet been mentioned is how concurrent gestures combine. The gestural score specifies the periods when the gestures are activated. The task-dynamic model governs how the action of the articulators is coordinated within a single gesture, making use of the articulator weightings. When two or more gestures are concurrently active, they may share a tract variable, or they may involve different tract variables but affect a common articulator. In both cases, the influences of the overlapping gestures are said to be blended. For blending within a shared tract variable, the parameters associated with each gesture are combined either by simple averaging, weighted averaging, or addition. (See Saltzman and Munhall [1989] for more detailed discussion of gestural blending both within and across tract variables.) 1.4 Evaluative remarks
Given the purpose of this paper, a lengthy critique is not appropriate, but since the description so far has been uncritical, some brief evaluative comments may be helpful. A particularly rich source of evaluation is the theme issue of the Journal of Phonetics (1986, 14 [1]) dedicated to event perception and action theory, although there have since been changes to task dynamic theory, particularly as it applies to speech. For a more recent account, see Saltzman and Munhall (1989). The damped mass-spring model is attractive because it is simple, and general enough to be applicable to any form of movement. Its use 19
Gesture
represents a significant step forward in explaining skilled movement. But in order to implement the model in a manageable way, a number of simplifications and (somewhat) ad hoc decisions have been introduced. I have a lot of sympathy with this approach, since it often allows us to pursue the important questions instead of getting bogged down by details. Eventually, however, the simplifications and ad hoc decisions have to be dealt with; it is as well not to lose sight of them, so that the model can be modified or replaced when their disadvantages outweigh their advantages. One simplification seems to me to be that of abstract mass, and consequently the relationship between mass and stiffness, since the two can compensate for one another while both are abstract. This assumption may require eventual modification. Consider what the tongue tip-blade does to produce an alveolar trill on the one hand, and a laminal stop on the other. For the trill, the relationships among the physical mass of the tip, its biomechanical stiffness, and the aerodynamic forces are critical. For the stop, the relationships between these factors are less critical (though still important), but the tip and blade, acting now as a functional unit, must have a much greater mass than the tip can have for the trill. As far as I can see, task dynamics in its present form cannot produce a trill, and I am not sure that it can produce a laminal as opposed to an apical stop. The point is that mass is always abstract in the task-dynamic system, but for some articulations the real mass of an articulator is crucial. If it turns out that physical mass does have to be used to account for sounds like trills and apical vs. laminal stops, then it would be reasonable to reevaluate the relationship between abstract and actual physical mass. Including physical mass as well as abstract mass will force investigators to take a close look at some of the other parameters in the movement equations. The most obvious is how stiffness will be used as a control variable. As long as stiffness and mass are both abstract, variation in stiffness can substitute for variation in mass. But if stiffness in the model comes to be related to biomechanical stiffness, as Browman and Goldstein (1989) have suggested it might be, then the value of the biomechanical mass becomes very important. Saltzman and Kelso (1987) describe for skilled limb activities how abstract and biomechanical dynamics might be related to one another. An example of an apparently ad hoc solution to a problem is the choice of critical damping for all nonglottal gestures. There seem to be two main motivations for using critical damping: first, critical damping is a straightforward way of implementing the point-attractor equation to produce the asymptotic type of movement typical of target-directed tasks; second, it 20
1 Sarah Hawkins
represents a simple compromise between the faster but overshooting underdamped case, and the slower but non-overshooting overdamped case. Critical damping is straightforward to use because, as mentioned above, the damping factor b is specified indirectly via the damping ratio, b/(2*[mk]1/2), which is constrained to equal 1.0 in a critically damped system. But since it includes the independent stiffness parameter k, then if the ratio is constrained, b is a function of k. This dependence of damping on stiffness may be reasonable for movement of a single articulator, as used for the decay to the neutral rest position of individual uncontrolled articulators, but it seems less reasonable for modeling gestures. Moreover, although the damping factor is crucially important, the choice of critical damping is not much discussed and seems relatively arbitrary in that other types of damping can achieve similar asymptotic trajectories. (Fujimura [1990: 377-81] notes that there are completely different ways of achieving the same type of asymptotic trajectories.) I would be interested to know why other choices have been rejected. One question is whether the same damping factor should be used for all gestures. Is the trajectory close to the target necessarily the same for a stop as for a fricative, for example? I doubt it. The method of coordinating gestures by specifying relative phases poses problems for a model based on critically damped sinusoids. As mentioned above, a critically damped system does not lend itself to a description in terms of phase angles. Browman and Goldstein's current solution of mapping the critically damped trajectory against an undamped sinusoid is a straightforward empirical solution that has the advantage of being readily understandable. It is arguably justifiable since they are concerned more with developing a comprehensive theory of phonology-phonetics, for which they need an implementable model, than with developing the task-dynamic model itself. But the introduction of an additional level of abstractness in the model - the undamped gestural cycle - to explain the relative timing of gestures seems to me to need independent justification if it is to be taken seriously. Another issue is the relationship between the control of speech movements and the vegetative activity of the vocal tract. The rest position towards which an articulator moves when it is not actively controlled in speech is unlikely to be the same as the rest position during quiet breathing, for example. Similarly, widening of the glottis to resume normal breathing after phonation is an active gesture, not a gradual movement towards a rest position. These properties could easily be included in an ad hoc way in the current model, but they raise questions of how the model should account for different task-dynamic systems acting 21
Gesture
on the same articulators - in this case the realization of volitional linguistic intention coordinated with automatic behavior that is governed by neural chemoreceptors and brainstem activity. The attempt to answer questions such as how the task-dynamic model accounts for different systems affecting the same articulators partly involves questions about the role of the gestural score: is the gestural score purely linguistic, and if so why; and to what extent does it (or something like it) include linguistic intentions, as opposed to implementing those intentions? One reason why these questions will be hard to answer is that they include some of the most basic issues in phonology and phonetics. The status of the neutral attractor (rest position of each articulator) is a case in point. Since the neutral attractor is separately determined for each articulator, it is controlled within the task-dynamic model in the current system. But if the various neutral attractors define the base of articulation, and if, as seems reasonable, the base of articulation is language-specific and hence is not independent of the phonology of the language, then it should be specified in the gestural score if Browman and Goldstein are correct in stating (this volume) that all language-specific phonetic/phonological structure is found there. A related issue is how the model accounts for learning - the acquisition of speech motor control. Developmental phonologists have tended to separate issues of phonological competence from motoric skill. But the role that Browman and Goldstein, for example, assign to the gestural score suggests that these authors could take a very different approach to the acquisition of phonology. The relationship between the organization of phonology during development and in the adult is almost certainly not simple, but to attempt to account for developmental phonology within the task-dynamic (or articulatory phonology) model could help clarify certain aspects of the model. It could indicate, for example, the extent to which the gestural score can reasonably be considered to encompass phonological primitives, and whether learned, articulator-specific skills like articulatory setting or base of articulation are best controlled by the gestural score or within the task-dynamic system. Browman and Goldstein (1989) have begun to address these issues. Finally, there is the question of how much the model offers explanation rather than description. The large number of variables and parameters makes it likely that some observed movement patterns can be modelled in more than one way. Can the same movement trajectory arise from different parameter values, types of blending, or activation periods and degree of gestural overlap? If this happens, how does one choose between alternatives? At the tract variable level, should there be only one possible way to model a given movement trajectory? In other words, if there are 22
1 Sarah Hawkins
several alternatives, is the diversity appropriate, or does it reduce the model's explanatory power? A related problem is that the model is as yet incomplete. This raises the issue of whether revisions will affect in important ways the conclusions derived from work with the current model. In particular, it is clear that the current set of eight or nine tract variables is likely to need revision, although those used are among the most fundamental. Browman and Goldstein (1989) name some tongue and laryngeal variables that should be added; in my discussion to their paper in this volume, I suggest that aerodynamic variables should also eventually be included. Aerodynamic control may be difficult to integrate into the task-dynamic model, which is presently couched entirely in terms of the degree and location of constrictions. More importantly, control of aerodynamics involves articulatory as well as respiratory factors and therefore may influence articulatory movement patterns. It is an open question whether including such new variables will invalidate any of the present results. The current set of variables seems to capture important properties of speech production, and it seems reasonable to assume that many of the insights obtained by using them are valid. But as the questions addressed with the model become more detailed, it becomes important to remember that the model is not yet complete and that it will be revised. Any model of a poorly understood complex system can be faulted for oversimplifying and for ad hoc decisions. The important question at this stage is what task dynamics has achieved. It is too early to give a final answer, but there is no question that task dynamics is making significant contributions. It sets out to account systematically for observed characteristics of coordinated movement. To do so, it asks what the organizing principles of coordinated movement are, framing its answers in dynamical rather than physiological terms. Its explicit mathematical basis means that it is testable, and its generality makes it applicable to any form of skilled movement. Although future models might be connected more closely to physiology, a useful first step is to concentrate on the basic organizational principles - on the dynamics governing the functional groupings of coordinated articulator movement. In the analysis of speech, task-dynamics unifies the traditional issues of coarticulation, speech rate, and speech style into a single framework. By providing a vocabulary and a mechanism for investigating the similarities, the differences are brought into sharper focus. This same vocabulary and framework promises a fresh approach to the discussion of linguistic units. In the recent work of Browman and Goldstein and of Saltzman and Munhall, for example, wefindwelcome unification and operationalization of terms in phonology on the one hand and phonetics on the other. 23
Gesture
Whether or not the approach embodied in task dynamics becomes generally accepted, the debate on the issues it raises promises to be both lively and valuable. Concluding remarks
Task dynamics offers a systematic, general account of the control of skilled movement. As befits the data, it is a complicated system involving many parameters and variables. In consequence, there is tension between the need to explore the system itself, and the need to use it to explore problems within (in our case) phonology and phonetics. I restrict these final comments to the application of task dynamics to phonological and phonetic processes of speech production, rather than to details of the execution of skilled movement in general. As a system in itself, I am not convinced that task dynamics will solve traditional problems of phonetics and phonology. It is not clear, for example, that its solutions for linguistic intentions, linguistic units, and the invariance-variability issue will prove more satisfactory than other solutions. Moreover, there are arguments for seeking a model of speech motor control that is couched more in physiological rather than dynamic terms, and that accounts for the learning of skilled movements as well as for their execution once they are established. Nevertheless, partly because it is nonspecific, ambitious enough to address wide-ranging issues, and explicit enough to allow alternative solutions to be tried within the system, task dynamics is well worth developing because it brings these issues into focus. The connections it draws between the basic organizing principles of skilled movement and potential linguistic units raise especially interesting questions. It may be superseded by other models, but those future models are likely to owe some of their properties to research within the task-dynamic approach. Amongst the attributes I find particularly attractive are the emphasis on speech production as a dynamic process, and the treatment of coarticulation, rephrased as coproduction, as an inherent property of gestural dynamics, so that changes in rate and style require relatively simple changes in global parameter values, rather than demanding new targets and computation of new trajectories for each new type of utterance. It is easier to evaluate the contribution of task dynamics in exploring general problems of phonology and phonetics. The fact that the taskdynamic model is used so effectively testifies to its value. Browman and Goldstein, for example, use the model to synthesize particular speech patterns, and then use the principles embodied in the model to draw out the implications for the organization of phonetics and phonology. But 24
1 Sarah Hawkins
beyond the need to get the right general effects, it seems to me that the details of the implementation are not very important in this approach. The value of the model in addressing questions like those posed by Browman and Goldstein is therefore as much in its explicitness and relative ease of use as in its details. The point, then, at this early stage, is that it does not really matter if particular details of the task dynamics are wrong. The value of the task-dynamic model is that it enables a diverse set of problems in phonology and phonetics to be studied within one framework in a way that has not been done before. The excitement in this work is that it offers the promise of new ways of thinking about phonetic and phonological theory. Insofar as task dynamics allows description of diverse phenomena in terms of general physical laws, then it provides insights that are as near as we can currently get to explanation.
25
2
"Targetless" schwa: an articulatory analysis CATHERINE P. BROWMAN and LOUIS GOLDSTEIN
2.1 Introduction
One of the major goals for a theory of phonetic and phonological structure is to be able to account for the (apparent) contextual variation of phonological units in as general and simple a way as possible.* While it is always possible to state some pattern of variation using a special "low-level" rule that changes the specification of some unit, recent approaches have attempted to avoid stipulating such rules, and instead propose that variation is often the consequence of how the phonological units, properly defined, are organized. Two types of organization have been suggested that lead to the natural emergence of certain types of variation: one is that invariantly specified phonetic units may overlap in time, i.e., they may be coproduced (e.g., Fowler 1977, 1981a; Bell-Berti and Harris 1981; Liberman and Mattingly 1985; Browman and Goldstein 1990), so that the overall tract shape and acoustic consequences of these coproduced units will reflect their combined influence; a second is that a given phonetic unit may be unspecified for some dimension(s) (e.g., Ohman 1966b; Keating 1988a), so that the apparent variation along that dimension is due to continuous trajectories between neighboring units' specifications for that dimension. A particularly interesting case of contextual variation involves reduced (schwa) vowels in English. Investigations have shown that these vowels are particularly malleable: they take on the acoustic (Fowler 1981a) and articulatory (e.g., Alfonso and Baer 1982) properties of neighboring vowels. While Fowler (1981a) has analyzed this variation as emerging from the coproduction of the reduced vowels and a neighboring stressed vowel, it might also be *Our thanks to Ailbhe Ni Chasaide, Carol Fowler, and Doug Whalen for criticizing versions of this paper. This work was supported by NSF grant BNS 8820099 and NIH grants HD-01994 and NS-13617 to Haskins Laboratories. 26
2 Catherine P. Browman and Louis Goldstein
the case that schwa is completely unspecified for tongue position. This would be consistent with analyses of formant trajectories for medial schwa in trisyllabic sequences (Magen 1989) that have shown that F2 moves (roughly continuously) from a value dominated by the preceding vowel (at onset) to one dominated by the following vowel (at offset). Such an analysis would also be consistent with the phonological analysis of schwa in French (Anderson 1982) as an empty nucleus slot. It is possible (although this is not Anderson's analysis), that the empty nucleus is never filled in by any specification, but rather there is a specified "interval" of time between two full vowels in which the tongue continuously moves from one vowel to another. The computational gestural model being developed at Haskins Laboratories (e.g. Browman et al. 1986; Browman and Goldstein, 1990; Saltzman et al. 1988a) can serve as a useful vehicle for testing these (and other) hypotheses about the phonetic/phonological structure of utterances with such reduced schwa vowels. As we will see, it is possible to provide a simple, abstract representation of such utterances in terms of gestures and their organization that can yield the variable patterns of articulatory behavior and acoustic consequences that are observed in these utterances. The basic phonetic/phonological unit within our model is the gesture, which involves the formation (and release) of a linguistically significant constriction within a particular vocal-tract subsystem. Each gesture is modeled as a dynamical system (or set of systems) that regulates the timevarying coordination of individual articulators in performing these constriction tasks (Saltzman 1986). The dimensions along which the vocal-tract goals for constrictions can be specified are called tract variables, and are shown in the left-hand column of figure 2.1. Oral constriction gestures are defined in terms of pairs of these tract variables, one for constriction location, one for constriction degree. The right-hand side of the figure shows the individual articulatory variables whose motions contribute to the corresponding tract variable. The computational system sketched in figure 2.2 (Browman and Goldstein, 1990; Saltzman et al. 1988a) provides a representation for arbitrary (English) input utterances in terms of such gestural units and their organization over time, called the gestural score. The layout of the gestural score is based on the principles of intergestural phasing (Browman and Goldstein 1990) specified in the linguistic gestural model. The gestural score is input to the task-dynamic model (Saltzman 1986; Saltzman and Kelso 1987), which calculates the patterns of articulator motion that result from the set of active gestural units. The articulatory movements produced by the task-dynamic model are then input to an articulatory synthesizer (Rubin, Baer, and Mermelstein 1981) to calculate an output speech waveform. The operation of 27
Gesture LP
LA
Tract variable lip protrusion; lip aperture
Articulators involved upper and lower lips, jaw upper and lower lips, jaw
TTCL TTCD
tongue-tip constriction location tongue-tip constriction degree
tongue-tip, tongue-body, jaw tongue-tip, tongue-body, jaw
TBCL TBCD
tongue-body constriction location tongue-body constriction degree
tongue-body, jaw tongue-body, jaw
VEL
velic aperture
velum
GLO
glottal aperture
glottis
\
upper lip
velum
lower lip
GLO glottis
Figure 2.1 Tract variables and associated articulators
the task-dynamic model is assumed to be "universal." (In fact, it is not even specific to speech, having originally been developed [Saltzman and Kelso 1987] to describe coordinated reaching movements.) Thus, all of the language-particular phonetic/phonological structure must reside in the gestural score - in the dynamic parameter values of individual gestures, or in their relative timing. Given this constraint, it is possible to test the adequacy of some particular hypothesis about phonetic structure, as embodied in a particular gestural score, by using the model to generate the articulatory motions and comparing these to observed articulatory data. The computational model can thus be seen as a tool for evaluating the articulatory (and acoustic) consequences of hypothesized aspects of gestural structure. In particular, it is well suited for evaluating the consequences of the organizational properties discussed above: (1) underspecification and (2) temporal overlap. Gestural structures are inherently underspecified in the sense that there are intervals of time during which the value of a given tract variable is not being controlled by the system; only when a gesture defined along that tract variable is active is such control in place. This underspecifi28
2 Catherine P. Browman and Louis Goldstein Intended utterance
Output speech
Linguistic gestural model
Articulatory synthesizer
Figure 2.2 Overview of GEST: gestural computational model
cation can be seen in figure 2.3, which shows the gestural score for the utterance /pam/. Here, the shaded boxes indicate the gestures, and are superimposed on the tract-variable time functions produced when the gestural score is input to the task-dynamic model. The horizontal dimension of the shaded boxes indicates the intervals of time during which each of the gestural units is active, while the height of the boxes corresponds to the "target" or equilibrium position parameter of a given gesture's dynamical control regime. See Hawkins (this volume) for a more complete description of the model and its parameters. Note that during the activation interval of the initial bilabial closure gesture, Lip Aperture (LA - vertical distance between the two lips) gradually decreases, until it approaches the regime's target. However, even after the regime is turned off, LA shows changes over time. Such "passive" tractvariable changes result from two sources: (1) the participation of one of the (uncontrolled) tract variable's articulators in some other tract variable which is under active gestural control, and (2) an articulator-specific "neutral" or "rest" regime, that takes control of any articulator which is not currently active in any gesture. For example, in the LA case shown here, the jaw contributes to the Tongue-Body constriction degree (TBCD) gesture (for the vowel) by lowering, and this has the side effect of increasing LA. In addition, the upper and lower lips are not involved in any active gesture, and so move towards their neutral positions with respect to the upper and lower teeth, thus further contributing to an increase in LA. Thus, the geometric structure of the model itself (together with the set of articulator-neutral values) predicts a specific, well-behaved time function for a given tract variable, even 29
Gesture Input String:/1paam/; Velic aperture
Tongue-body constriction degree
I
ph
narrow
Lip aperture
Glottal aperture 100
200
300
400
Time (msec.)
Figure 2.3 Gestural score and generated motion variables for /pam/. The input is specified in ARPAbet, so /pam/ = ARPAbet input string /paam/. Within each panel, the height of the box indicates degree of opening (aperture) of the relevant constriction: the higher the curve (or box) the greater the amount of opening
when it is not being controlled. Uncontrolled behavior need not be stipulated in any way. This feature of the model is important to being able to test the hypothesis that schwa may not involve an active gesture at all. The second useful aspect of the model is the ability to predict consequences of the temporal overlap of gestures, i.e., intervals during which there is more than one concurrently active gesture. Browman and Goldstein (1990) have shown that the model predicts different consequences of temporal overlap, depending on whether the overlapping gestures involve the same or different tract variables, and that these different consequences can actually be observed in allophonic variations and "casual speech" alternations. Of particular importance to analyzing schwa is the shared tract variable case, since we will be interested in the effects of overlap between an active schwa gesture (if any) and the preceding or following vowel gesture, all of which would involve the Tongue-Body tract variables (TBCD, and Tongue-Body constriction location - TBCL). In this case, the dynamic parameter values for the overlapping gestures are "blended," according to a competitive blending dynamics (Saltzman et al. 1988a; Saltzman and Munhall 1989). In the examples we will be examining, the blending will have the effect of averaging the parameter values. Thus, if both gestures were coextensive for their entire 30
2 Catherine P. Browman and Louis Goldstein
activation intervals, neither target value would be achieved, rather, the value of the tract variable at the end would be the average of their targets. In this paper, our strategy is to analyze movements of the tongue in utterances with schwa to determine if the patterns observed provide evidence for a specific schwa tongue target. Based on this analysis, specific hypotheses about the gestural overlap in utterances with schwa are then tested by means of computer simulations using the gestural model described above. 2.2 Analysis of articulatory data
Using data from the Tokyo X-ray archive (Miller and Fujimura 1982), we analyzed /pVlp9'pV2p9/ utterances produced by a speaker of American English, where VI and V2 were all possible combinations of /i, e, a, A, U/. Utterances were read in short lists of seven or eight items, each of which had the same VI and different V2s. One token (never the initial or final item in a list) of each of the twenty-five utterance types was analyzed. The microbeam data tracks the motion of five pellets in the mid-sagittal plane. Pellets were located on the lower lip (L), the lower incisor for jaw movement (J), and the midline of the tongue: one approximately at the tongue blade (B), one at the middle of the tongue dorsum (M), and one at the rear of the tongue dorsum (R). Ideally, we would use the information in tongue-pellet trajectories to infer a time-varying representation of the tongue in terms of the dimensions in which vowel-gesture targets are defined, e.g., for our model (or for Wood 1982), location and degree of tongue-body constriction. (For Ladefoged and Lindau [1989], the specifications would be rather in terms of formant frequencies linked to the factors of front-raising and back-raising for the tongue.) Since this kind of transformation cannot currently be performed with confidence, we decided to describe the vowels directly in terms of the tongue-pellet positions. As the tongue-blade pellet (B) was observed to be largely redundant in these utterances (not surprising, since they involve only vowels and bilabial consonants), we chose to measure the horizontal (X) and vertical (Y) positions for the M and R pellets for each vowel. While not ideal, the procedure at least restricts its a priori assumption about the parameterization of the tongue shape to that inherent in the measurement technique. The first step was to find appropriate time points at which to measure the position of the pellets for each vowel. The time course of each tongue-pellet dimension (MX, MY, RX, RY) was analyzed by means of an algorithm that detected displacement extrema (peaks and valleys). To the extent that there is a characteristic pellet value associated with a given vowel, we may expect to see such a displacement extremum, that is, movement towards some value, then away again. The algorithm employed a noise level of one X-ray grid unit 31
Gesture
(approximately 0.33 mm); thus, movements of a single unit in one direction and back again did not constitute extrema. Only the interval that included the full vowels and the medial schwa was analyzed; final schwas were not analyzed. In general, an extremum was found that coincided with each full vowel, for each pellet dimension, while such an extremum was missing for schwa in over half the cases. The pellet positions at these extrema were used as the basic measurements for each vowel. In cases where a particular pellet dimension had no extremum associated with a vowel, a reference point was chosen that corresponded to the time of an extremum of one of the other pellets. In general, MY was the source of these reference points for full vowels, and RY was the source for schwa, as these were dimensions that showed the fewest missing extrema. After the application of this algorithm, each vowel in each utterance was categorized by the value at a single reference point for each of the four pellet dimensions. Since points were chosen by looking only at data from the tongue pellets themselves, these are referred to as the "tongue" reference points. To illustrate this procedure, figure 2.4a shows the time courses of the M, R, and L pellets (only vertical for L) for the utterance /pips'pips/ with the extrema marked with dashed lines. The acoustic waveform is displayed at the top. For orientation, note that there are four displacement peaks marked for LY, corresponding to the raising of the lower lip for the four bilabial-closure gestures for the consonants. Between these peaks three valleys are marked, corresponding to the opening of the lips for the three vowels. For MX, MY, and RX, an extremum was found associated with each of the full vowels and the medial schwa. For RY, a peak was found for schwa, but not for VI. While there is a valley detected following the peak for schwa, it occurs during the consonant closure interval, and therefore is not treated as associated with V2. Figure 2.4b shows the same utterance with the complete set of "tongue" reference points used to characterize each vowel. Reference points that have been copied from other pellets (MY in both cases) are shown as solid lines. Note that the consonant-closure interval extremum has been deleted. Figure 2.5 shows the same displays for the utterance /pips'papa/. Note that, in (a), extrema are missing for schwa for MX, MY, and RX. This is typical of cases in which there is a large pellet displacement between VI and V2. The trajectory associated with such a displacement moves from VI to V2, with no intervening extremum (or even, in some cases, no "flattening" of the curve). As can be seen in figures 2.4 and 2.5, the reference points during the schwa tend to be relatively late in its acoustic duration. As we will be evaluating the relative contributions of VI and V2 in determining the pellet positions for schwa, we decided also to use a reference point earlier in the schwa. To obtain such a point, we used the valley associated with the lower lip for the 32
2 Catherine P. Browman and Louis Goldstein (a) Waveform Tongue middle horizontal (MX) Tongue middle vertical (MY) Tongue rear horizontal (RX) Tongue rear vertical (RY) Lower lip vertical (LY) 700 Time (msec.)
(b)
p
Waveform
i
p
- ^
i
p
3
I ^""-—
~~
Tongue middle vertical (MY)
——^ _
_
_
_
_
_
^
-
• " " " " "
^ — — ~ ~
Tongue rear vertical (RY) Lower lip vertical (LY)
p
fffliF
Tongue middle horizontal (MX)
Tongue rear horizontal (RX)
3
"T"""--- ———" "
1
v
. -—--^
! 700 Time (msec.)
Figure 2.4 Pellet time traces for /pips'pips/. The higher the trace, the higher (vertical) or more fronted (horizontal) the corresponding movement, (a) Position extrema indicated by dashed lines, (b) "Tongue" reference points indicated by dashed and solid lines (for Middle and Rear pellets) 33
Gesture (a) Waveform Tongue middle horizontal (MX) Tongue middle vertical (MY) Tongue rear horizontal (RX) Tongue rear vertical (RY) Lower lip vertical (LY) 700 Time (msec.)
(b)
p
h
i
p
a
h
p
a
p
a
Waveform Tongue middle horizontal (MX)
_ _ _ _ _ _ [
Tongue middle vertical (MY) Tongue rear horizontal (RX) Tongue rear vertical (RY) Lower lip vertical (LY)
•—-—
-—-^
—-__
_j
__--—~
~
____—\~
i
1 M M
~""M^ 1 1
700
Time (msec.)
Figure 2.5 Pellet time traces for /pipa'papg/. The higher the trace, the higher (vertical) or more fronted (horizontal) the corresponding movement, (a) Position extrema indicated by dashed lines, (b) "Tongue" reference points indicated by dashed and solid lines (for Middle and Rear pellets)
34
2 Catherine P. Browman and Louis Goldstein
schwa - that is, approximately the point at which the lip opening is maximal. This point, called the "lip" reference, typically occurs earlier in the (acoustic) vowel duration than the "tongue" reference point, as can be seen in figures 2.4 and 2.5. Another advantage of the "lip" reference point is that all tongue pellets are measured at the same moment in time. Choosing points at different times for different dimensions might result in an apparent differential influence of VI and V2 across dimensions. Two different reference points were established only for the schwa, and not for the full vowels. That is, since the full vowels provided possible environmental influences on the schwa, the measure of that influence needed to be constant for comparisons of the "lip" and "tongue" schwa points. Therefore, in analyses to follow, when "lip" and "tongue" reference points are compared, these points differ only for the schwa. In all cases, full vowel reference points are those determined using the tongue extremum algorithm described above. 2.2.1 Results Figure 2.6 shows the positions of the M (on the right) and R (on the left) pellets for the full vowels plotted in the mid-sagittal plane such that the speaker is assumed to be facing to the right. The ten points for a given vowel are enclosed in an ellipse indicating their principal components (two standard deviations along each axis). The tongue shapes implied by these pellet positions are consistent with cinefluorographic data for English vowels (e.g., Perkell 1969; Harshman, Ladefoged, and Goldstein, 1977; Nearey 1980). For example, /i/ is known to involve a shape in which the front of the tongue is bunched forward and up towards the hard palate, compared, for example, to /e/, which has a relatively unconstricted shape. This fronting can be seen in both pellets. In fact, over all vowels, the horizontal components of the motion of the two pellets are highly correlated (r = 0.939 in the full vowel data, between RX and MX over the twenty-five utterances). The raising for I'll can be seen in M (on the right), but not in R, for which /i/ is low - lower, for example, than /a/. The low position of the back of the tongue dorsum for I'll can, in fact, be seen in mid-sagittal cinefluorographic data. Superimposed tongue surfaces for different English vowels (e.g. Ladefoged 1982) reveal that the curves for /i/ and /a/ cross somewhere in the upper pharyngeal region, so that in front of this point, /i/ is higher than /a/, while behind this point, /a/ is higher. This suggests that the R pellet in the current experiment is far enough back to be behind this cross-over point, /u/ involves raising of the rear of the tongue dorsum (toward the soft palate), which is here reflected in the raising of both the R and M pellets. In general, the vertical components of the two pellets are uncorrelated across the set of vowels as a whole (r = 0.020), 35
Gesture 300,
260i
/ x x7 V
220
140
180
A
220
Figure 2.6 Pellet positions for full vowels, displayed in mid-sagittal plane with head facing to the right: Middle pellets on the right, Rear pellets on the left. The ellipses indicate two standard deviations along axes determined by principal-component analysis. Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/, and A = /a/. Units are X-ray units (= 0.33 mm)
reflecting, perhaps, the operation of two independent factors such as "frontraising" and "back-raising" (Ladefoged 1980). The pellet positions for schwa, using the "tongue" reference points, are shown in the same mid-sagittal plane infigure2.7, with the full vowel ellipses added for reference. The points are labeled by the identity of the following vowel (V2) in (a) and by the preceding vowel (VI) in (b). Figure 2.8 shows the parallel figure for schwa measurements at the "lip" reference point. In both figures, note that the range of variation for schwa is less than the range of variation across the entire vowel space, but greater than the variation for any single full vowel. Variation in MY is particularly large compared to MY variation for any full vowel. Also, while the distribution of the R pellet positions appears to center around the value for unreduced /A/, which might be thought to be a target for schwa, this is clearly not the case for the M pellet, where the schwa values seem to center around the region just above /e/. For both pellets, the schwa values are found in the center of the region occupied by the full vowels. In fact, this relationship turns out to be quite precise. Figure 2.9 shows the mean pellet positions for each full vowel and for schwa ("lip" and "tongue" reference points give the same overall means), as well as the grand mean of pellet positions across all full vowels, marked by a circle. The mean pellet positions for the schwa lie almost exactly on top of the 36
2 Catherine P. Browman and Louis Goldstein (a)
300
220
140
220
Figure 2.7 Pellet positions for schwa at "tongue" reference points, displayed in right-facing mid-sagittal plane as in figure 2.6. The ellipses are from the full vowels (figure 2.6), for comparison. Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/, and A = /a/. Units are X-ray units ( = 0.33 mm), (a) Schwa pellet positions labeled by the identity of the following vowel (V2). (b) Schwa pellet positions labeled by the identity of the preceding vowel (VI)
grand mean for both the M and R pellets. This pattern of distribution of schwa points is exactly what would be expected if there were no independent target for schwa but rather a continuous tongue trajectory from VI to V2. Given all possible combinations of trajectory endpoints (VI and V2), we would expect the mean value of a point located at (roughly) the midpoint of 37
Gesture (a)
300
260-
220 (b)
140
300
260
220
140
180
220
Figure 2.8 Pellet positions for schwa at "lip" reference points, displayed as in figure 2.7 (including ellipses from figure 2.6). Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/, and A = /a/. Units are X-ray units ( = 0.33 mm). (a) Schwa pellet positions labeled by the identity of the following vowel (V2). (b) Schwa pellet positions labeled by the identity of the preceding vowel (VI)
these twenty-five trajectories to have the same value as the mean of the endpoints themselves. If it is indeed the case that the schwa can be described as a targetless point on the continuous trajectory from VI to V2, then we would expect that the schwa pellet positions could be predicted from knowledge of VI and V2 38
2 Catherine P. Browman and Louis Goldstein
300. 280.
[B UE>
•U
260. a 240.
• % A
<
•.
aB
220. POO 120
140
160
180
200
220
240
Figure 2.9 Mean pellet positions for full vowels and schwa, displayed in right-facing midsagittal plane as in figure 2.6. The grand mean of all the full vowels is indicated by a circled square. Units are X-ray units ( = 0.33 mm)
positions alone, with no independent contribution of schwa. To test this, we performed stepwise multiple linear regression analyses on all possible subsets of the predictors VI position, V2 position, and an independent schwa factor, to determine which (linear) combinations of these three predictors best predicted the position of a given pellet dimension during schwa. The analysis finds the values of the b coefficients and the constant k, in equations like (1) below, that give the best prediction of the actual schwa values. (1) schwa(predicted) = bl*Vl + b2*V2 + k The stepwise procedure means that variables are added into an equation such as (1) one at a time, in the order of their importance to the prediction. The procedure was done separately for equations with and without the constant term k (using BMDP2R and 9R). For those analyses containing the constant term (which is the y-intercept), k represents an independent schwa contribution to the pellet position - when it is the only predictor term, it is the mean for schwa. Those analyses without the constant term (performed using 9R) enabled the contributions of VI and V2 to be determined in the absence of this schwa component. Analyses were performed separately for each pellet dimension, and for the "tongue" and "lip" reference points. The results for the "tongue" points are shown in the right-hand columns of table 2.1. For each pellet, the various combinations of terms included in the 39
Gesture
equation are rank-ordered according to the standard error of the schwa prediction for that combination, the smallest error shown at the top. In all cases, the equation with all three terms gave the least error (which is necessarily true). Interestingly, however, for MX, RX, and RY, the prediction using the constant and V2 differed only trivially from that using all three variables. This indicates that, for these pellets, VI does not contribute substantially to schwa pellet positions at the "tongue" reference point. In addition, it indicates that an independent schwa component is important to the prediction, because V2 alone, or in combination with VI, gives worse prediction than V2 plus k. For MY, all three terms seem to be important - removing any one of them increases the error. Moreover, the second-best prediction involves deleting the V2 term, rather than VI. The reduced efficacy of V2 (and increased efficacy of VI) in predicting the MY value of schwa may be due, in part, to the peak determination algorithm employed. When VI or V2 was /a/ or /A/, the criteria selected a point for MY that tended to be much later in the vowel than the point chosen for the other pellet dimensions (figure 2.5b gives an example of this). Thus, for V2, the point chosen for MY is much further in time from the schwa point than is the case for the other dimensions, while for VI, the point chosen is often closer in time to the schwa point. The overall pattern of results can be seen graphically in figure 2.10. Each panel shows the relation between "tongue" pellet positions for schwa and the full vowels: VI in the top row and V2 in the bottom row, with a different pellet represented in each column. The points in the top row represent the pellet positions for the utterances with the indicated initial vowel (averaged across five utterances, each with a different final vowel), while the bottom row shows the average for the five utterances with the indicated final vowel. The differences between the effects of VI (top row) and V2 (bottom row) on schwa can be observed primarily in the systematicity of the relations. The relation between schwa and V2 is quite systematic - for every pellet, the lines do not cross in any of the panels of the bottom row - while for VI (in the top row), the relationship is only systematic for RY (and somewhat for MY, where there is some crossing, but large effects). Turning now to the "lip" reference points, regression results for these points are found in the left-hand column of table 2.1. The best prediction again involves all three terms, but here, in every case except RX, the best twoterm prediction does substantially worse. Thus VI, which had relatively little impact at the "tongue" point, does contribute to the schwa position at this earlier "lip" point. In fact, for these three pellets, the second-best prediction combination always involves VI (with either V2 or k as the second term). This pattern of results can be confirmed graphically in figure 2.11. Comparing the VI effects sketched in the top row of panels in figures 2.10 and 2.11, 40
2 Catherine P. Browman and Louis Goldstein Table 2.1 Regression results for X-ray data "Lip" reference point Terms MX
k + vl+v2 k + vl vl+v2 k vl
MY
k + vl+v2 k + vl vl+v2 k vl
RX
k + v2 + vl k + v2 k
vl+v2 vl RY
k + v2 + vl v2 + vl k + v2 k v2
"Tongue" reference point
Standard error
Terms
Standard error
4.6 5.5 5.6 6.3 8.6
MX
k + v2 + vl k + v2 k v2 + vl v2
4.5 4.6 6.2 6.4 10.8
4.2 5.0 7.6 10.3 11.9
MY
k + vl+v2 k + vl k vl+v2 vl
4.7 6.4 8.7 9.1 15.1
3.8 3.9 4.8 7.0 11.7 4.1 4.9 5.1 6.8 7.0
RX
k + v2 + vl k + v2 k v2 + vl v2
4.0 4.0 5.8 7.7 10.6
RY
k + v2 + vl k + v2 v2 + vl v2 k
4.5 4.6 5.3 6.2 7.1
notice that, although the differences are small, there is more spread between the schwa pellets at the "lip" point (figure 2.11) than at the "tongue" point (figure 2.10). This indicates that the schwa pellet was more affected by VI at the "lip" point. There is also somewhat less cross-over for MX and RX in the "lip" figure, indicating increased systematicity of the VI effect. In summary, it appears that the tongue position associated with medial schwa cannot be treated simply as an intermediate point on a direct tongue trajectory from VI to V2. Instead, there is evidence that this V1-V2 trajectory is warped by an independent schwa component. The importance of this warping can be seen, in particular, in utterances where VI and V2 are identical (or have identical values on a particular pellet dimension). For example, returning to the utterance /pips'pips/ in figure 2.4, we can clearly see (in MX, MY, and RX) that there is definitely movement of the tongue away from the position for /i/ between the VI and V2. This effect is most pronounced for /i/. For example, for MY, the prediction error for the 41
Gesture u-
"Tongue" reference MX
MY
a°
"A
• 1 ° e
RX
RY
250
V1
3
V2
V1
3
V2
1
V1
a
V2
a
a
V2
V1
a
V2
280
1
N V1
V1
/
/
230
200
V2
180
280
250
V1
/ / ^> ==•
130 V1
V2
a
V2
230
Figure 2.10 Relation between full vowel pellet positions and "tongue" pellet positions for schwa. The top row displays the pellet positions for utterances with the indicated initial vowels, averaged across five utterances (each with a different final vowel). The bottom row displays the averaged pellet positions for utterances with the indicated final vowels. Units are X-ray units ( = 0.33 mm)
equation without a constant is worse for /pips'pips/ than for any other utterance (followed closely by utterances combining /i/ and /u/; MY is very similar for /i/ and /u/). Yet, it may be inappropriate to consider this warping to be the result of a target specific to schwa, since, as we saw earlier, the mean tongue position for schwa is indistinguishable from the mean position of the tongue across all vowels. Rather the schwa seems to involve a warping of the trajectory toward an overall average or neutral tongue position. Finally, we saw that VI and V2 affect schwa position differentially at two points in time. The influence of the VI endpoint is strong and consistent at the "lip" point, relatively early in the schwa, while V2 influence is strong throughout. In the next section, we propose a particular model of gestural structure for these utterances, and show that it can account for the various patterns that we have observed. 2.3 Analysis of simulations
Within the linguistic gestural model of Browman and Goldstein (1990), we expect to be able to model the schwa effects we have observed as resulting from a structure in which there is an active gesture for the medial schwa, but 42
2 Catherine P. Browman and Louis Goldstein u-
"Lip" reference MX
a°
MY
"A
• 1 ° e RY
RX
250
2001 V1
a
V2
V1
a
V1
V2
a
V2
V1
s
V2
Figure 2.11 Relation between full vowel pellet positions and "lip" pellet positions for schwa. The top row displays the pellet positions for utterances with the indicated initial vowels, averaged across five utterances (each with a different final vowel). The bottom row displays the averaged pellet positions for utterances with the indicated final vowels. Units are X-ray units ( = 0.33 mm)
complete temporal overlap of this gesture and the gesture for the following vowel. The blending caused by this overlap should yield the V2 effect on schwa, while the VI effects should emerge as a passive consequence of the differing initial conditions for movements out of different preceding vowels. An example of this type of organization is shown in figure 2.12, which is the gestural score we hypothesized for the utterance /pips'paps/. As in figure 2.3, each box indicates the activation interval of a particular gestural control regime, that is, an interval of time during which the behavior of the particular tract variable is controlled by a second-order dynamical system with a fixed "target" (equilibrium position), frequency, and damping. The height of the box represents the tract-variable "target." Four LA closure-and-release gestures are shown, corresponding to the four consonants. The closure-andrelease components of these gestures are shown as separate boxes, with the closure components having the smaller target for LA, i.e., smaller interlip distance. In addition, four tongue-body gestures are shown, one for each of the vowels - VI, schwa, V2, schwa. Each of these gestures involves simultaneous activation of two tongue-body tract variables, one for constriction location and one for constriction degree. The control regimes for the VI and medial schwa gestures are contiguous and nonoverlapping, whereas the V2 gesture begins at the same point as the medial schwa and thus completely 43
Gesture p
i
p
9
300
400
p
a
pa
VEL TTCL TTCD TBCL TBCD LA LP GLO 100
200
500
600
700
800
900
Time (msec.)
Figure 2.12 Gestural score for /pipa'papa/. Tract variable channels displayed, from top to bottom, are: velum, tongue-tip constriction location and constriction degree, tongue-body constriction location and constriction degree, lip aperture, lip protrusion, and glottis. Horizontal extent of each box indicates duration of gestural activation; the shaded boxes indicate activation for schwa. For constriction-degree tract variables (VEL, TTCD, TBCD, LA, GLO), the higher the top of the box, the greater the amount of opening (aperture). The constrictionlocation tract variables (TTCL, TBCL) are defined in terms of angular position along the curved vocal tract surface. The higher the top of the box, the greater the angle, and further back and down (towards the pharynx) the constriction
overlaps it. In other words, during the acoustic realization of the schwa (approximately), the schwa and V2 gestural control regimes both control the tongue movements; the schwa relinquishes active control during the following consonant, leaving only the V2 tongue gesture active in the next syllable. While the postulation of an explicit schwa gesture overlapped by V2 was motivated by the particular results of section 2.2, the general layout of gestures in these utterances (their durations and overlap) was based on stiffness and phasing principles embodied in the linguistic model (Browman and Goldstein, 1990). Gestural scores for each of the twenty-five utterances were produced. The activation intervals were identical in all cases; the scores differed only in the TBCL and TBCD target parameters for the different vowels. Targets used for the full vowels were those in our tract-variable dictionary. For the schwa, the target values (for TBCL and TBCD) were calculated as the mean of the targets for the five full vowels. The gestural scores were input to the task44
2 Catherine P. Browman and Louis Goldstein Waveform Tongue center horizontal (CX) Tongue center vertical (CY) Lower lip vertical 100
200
300
400
500
600
700
800
900
Time (msec.)
Figure 2.13 Gestural score for /pipa'pipa/. Generated movements (curves) are shown for the tongue center and lower lip. The higher the curve, the higher (vertical) or more fronted (horizontal) the corresponding movement. Boxes indicate gestural activation; the shaded boxes indicate activation for schwa. CX is superimposed on TBCL, CY on TBCD, lower lip on LA. Note that the boxes indicate the degree of opening and angular position of the constriction (as described infigure2.12), rather than the vertical and horizontal displacement of articulators, as shown in the curves
dynamic model (Saltzman 1986), producing motions of the model articulators of the articulatory synthesizer (see figure 2.1). For example, for utterance /pipa'pipa/, figure 2.13 shows the resulting motions (with respect to a fixed reference on the head) of two of the articulators - the center of the tonguebody circle (C), and the lower lip, superimposed on the gestural score. Motion of the tongue is shown in both horizontal and vertical dimensions, while only vertical motion of the lower lip is shown. Note that the lower lip moves up for lip closure (during the regimes with the small LA value). Figure 2.14 shows the results for /pipa'papa/. The articulator motions in the simulations can be compared to those of the data in the previous section (figures 2.4 and 2.5). One difference between the model and the data stems from the fact that the major portion of the tongue dorsum is modeled as an arc of circle, and therefore all points on this part of the dorsum move together. Thus, it is not possible to model the differential patterns of motion exhibited by the middle (M) and rear (R) of the dorsum in the X-ray data. In general, the motion of CX is qualitatively similar to both MX and RX (which, recall, are highly correlated). For example, both the data and the simulation show a small backward movement for the schwa in /pipa'pipa/; in /pips'papa/, both show a larger backwards movement for schwa, with the target for /a/ reached shortly thereafter, early in the acoustic realization of V2. The motion of CY in the simulations tends to be similar to that of MY in the data. For example, in /pips'papa/, CY moves down from 45
Gesture Waveform Tongue center horizontal (CX) Tongue center vertical (CY) Lower lip vertical 100
200
300
400
500
600
700
800
900
Time (msec.)
Figure 2.14 Gestural score for /pipa'papa/. Generated movements (curves) are shown for the tongue center and lower lip. The higher the curve, the higher (vertical) or more fronted (horizontal) the corresponding movement. Boxes indicate gestural activation; the shaded boxes indicate activation for schwa. Superimposition of boxes and curves as in figure 2.13
/i/ to schwa to /a/, and the target for /a/ tends to be achieved relatively late, compared to CX. Movements corresponding to RY motions are not found in the displacement of the tongue-body circle, but would probably be reflected by a point on the part of the model tongue's surface that is further back than that section lying on the arc of a circle. The model articulator motions were analyzed in the same manner as the Xray data, once the time points for measurement were determined. Since for the X-ray data we assumed that displacement extrema indicated the effective target for the gesture, we chose the effective targets in the simulated data as the points to measure. Thus, points during VI and V2 were chosen that corresponded to the point at which the vowel gestures (approximately) reached their targets and were turned off (right-hand edges of the tongue boxes in figures 2.13 and 2.14). For schwa, the "tongue" reference point was chosen at the point where the schwa gesture was turned off, while the "lip" reference was chosen at the lowest point of the lip during schwa (the same criterion as for the X-ray data). The distribution of the model full vowels in the mid-sagittal plane (CX x CY) is shown in figure 2.15. Since the vowel gestures are turned off only after they come very close to their targets, there is very little variation across the ten tokens of each vowel. The distribution of schwa at the "tongue" reference point is shown infigure2.16, labeled by the identity of V2 (in a) and VI (in b), with the full vowel ellipses added for comparison. At this reference point that occurs relatively late, the vowels are clustered almost completely by V2, and the tongue center has moved a substantial portion of the way towards the 46
2 Catherine P. Browman and Louis Goldstein
1300
1200 850
950
1050
Figure 2.15 Tongue-center (C) positions for model full vowels, displayed in mid-sagittal plane with head facing to the right. The ellipses indicate two standard deviations along axes determined by principal-component analysis. Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/, and A = /a/. Units are ASY units ( = 0.09 mm), that is, units in the vocal tract model, measured with respect to the fixed structures.
following full vowel. The distribution of schwa values at the "lip" reference point is shown infigure2.17, labeled by the identity of V2 (in a), and of VI (in b). Comparing figure 2.17(a) with figure 2.16(a), we can see that there is considerably more scatter at the "lip" point than at the later "tongue" point. We tested whether the simulations captured the regularities of the X-ray data by running the same set of regression analyses on the simulations as were performed on the X-ray data. The results are shown in table 2.2, which has the same format as the X-ray data results in table 2.1. Similar patterns are found for the simulations as for the data. At the "tongue" reference point, for both CX and CY the best two-term prediction involves the schwa component (constant) and V2, and this prediction is nearly as good as that using all three terms. Recall that this was the case for all pellet dimensions except for MY, whose differences were attributed to differences in the time point at which this dimension was measured. (In the simulations, CX and CY were always measured at the same point in time.) These results can be seen graphically in figure 2.18, where the top row of panels shows the relation between VI and schwa, and the bottom row shows the relation between V2 47
Gesture 1400
1300
1200
850
(b)
1050
950
1400
1300
1200
1
850
1
1 950
I
I
1050
Figure 2.16 Tongue-center (C) positions for model schwa at "tongue" reference points, displayed in right-facing mid-sagittal plane as in figure 2.15. The ellipses are from the model full vowels (figure 2.15), for comparison. Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/, and A = /a/. Units are ASY units ( = 0.09 mm), (a) Model schwa positions labeled by the identity of the following vowel (V2). (b) Model schwa positions labeled by the identity of the preceding vowel (VI)
48
2 Catherine P. Browman and Louis Goldstein
(a)
1400
1300 -
1200
850
(P)
950
1050
950
1050
1400
1300 -
1200
850
Figure 2.17 Tongue-center (C) positions for model schwa at "lip" reference points, displayed as in figure 2.16 (including ellipses from figure 2.15). Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/, and A = /a/. Units are ASY units ( = 0.09 mm). (a) Model schwa positions labeled by the identity of the following vowel (V2). (b) Model schwa positions labeled by the identity of the preceding vowel (VI)
49
Gesture Table 2.2 Regression results of simulations
ex
"Lip" reference
point
"Tongue" reference point
Terms
Standard error
Terms
k + vl+v2 vl+v2 k + vl
7.5 9.1 20.4 29.2 33.3
CX
k + v2 + vl k + v2 v2 + vl v2 k
5.2 5.6 13.2 20.0 28.6
k + vl+v2 vl+v2 k + vl
4.7 13.5 19.7 29.1 41.6
CY
k + v2 + vl k + v2 v2 + vl v2 k
3.2 3.9 18.8 28.0 30.4
k vl CY
Standard error
k vl
and schwa. The same systematic relation between schwa and V2 can be seen in the bottom row as in figure 2.10 for the X-ray data, that is, no crossover. (The lack of systematic relations between VI and schwa in the X-ray data, indicated by the cross-overs in the top row of figure 2.10, is captured in the simulations in figure 2.18 by the lack of variation for the schwa in the top row.) Thus, the simulations capture the major statistical relation between the schwa and the surrounding full vowels at the "tongue" reference point, although the patterns are more extreme in the simulations than in the data. At the earlier "lip" reference point, the simulations also capture the patterns shown by the data. For both CX and CY, the three-term predictions in table 2.2 show substantially less error than the best two-term prediction. This was also the case for the data in table 2.1 (except for RX), where VI, V2 and a schwa component (constant) all contributed to the prediction of the value during schwa. This can also be seen in the graphs in figure 2.19, which shows a systematic relationship with schwa for both VI and V2. In summary, for the simulations, just as for the X-ray data, VI contributed to the pellet position at the "lip" reference, but not to the pellet position at the "tongue" point, while V2 and an independent schwa component contributed at both points. Thus, our hypothesized gestural structure accounts for the major regularities observed in the data (although not for all aspects of the data, such as its noisiness or differential behavior among pellets). The gestural-control regime for V2 begins simultaneously with that for schwa and overlaps it throughout its active interval. This accounts for the fact that V2 and schwa effects can be observed throughout the schwa, as both gestures 50
2 Catherine P. Browman and Louis Goldstein
"Tongue" reference
CY
cx 1000
1350
900
1250
1000
1350
900
1250 V2
V2
Figure 2.18 Relation between model full vowel tongue-center positions and tongue-center positions at "tongue" reference point for model schwas. The top row displays the tongue-center positions for utterances with the indicated initial vowels, averaged across five utterances (each with a different final vowel). The bottom row displays the averaged tongue-center positions for utterances with the indicated final vowels. Units are ASY units (= 0.09 mm)
unfold together. However, VI effects are passive consequences of the initial conditions when the schwa and V2 gestures are "turned on," and thus, their effects disappear as the tongue position is attracted to the "target" (equilibrium position) associated with the schwa and V2 regimes. 2.3.1 Other simulations While the X-ray data from the subject analyzed here argue against the strongest form of the hypothesis that schwa has no tongue target, we decided nevertheless to perform two sets of simulations incorporating the strong form of an "unspecified" schwa to see exactly where and how they would fail to reproduce the subject's data. In addition, if the synthesized speech were found to be correctly perceived by listeners, it would suggest that this gestural organization is at least a possible one for these utterances, and might be found for some speakers. In the first set of simulations, one of which is
Gesture "Lip" reference
CY
cx 1000
1350
900
1250
1000
1350
900
1250 V2
V2
Figure 2.19 Relation between model full vowel tongue-center positions and tongue-center positions at "lip" reference point for model schwas. The top row displays the tongue-center positions for utterances with the indicated initial vowels, averaged across five utterances (each with a different final vowel). The bottom row displays the averaged tongue-center positions for utterances with the indicated final vowels. Units are ASY units ( = 0.09 mm)
exemplified in figure 2.20, the gestural scores took the same form as in figure 2.12, except that the schwa tongue-body gestures were removed. Thus, active control of V2 began at the end of VI, and, without a schwa gesture, the tongue trajectory moved directly from VI to V2. During the acoustic interval corresponding to schwa, the tongue moved along this V1-V2 trajectory. The resulting simulations in most cases showed a good qualitative fit to the data, and produced utterances whose medial vowels were perceived as schwas. The problems arose in utterances in which VI and V2 were the same (particularly when they were high vowels). Figure 2.20 portrays the simulation for /pipa'pipa/: the motion variables generated can be compared with the data in figure 2.4. The "dip" between VI and V2 was not produced in the simulation, and, in addition, the medial vowel sounded like /i/ rather than schwa. This organization does not, then, seem possible for utterances where both VI and V2 are high vowels. We investigated the worst utterance (/pipa'pipa/) from the above set of 52
2 Catherine P. Browman and Louis Goldstein P
Waveform
i
" a"
P
P
i
P
Tongue center horizontal (CX) Tongue center vertical (CY)
_ -
^
Lower lip vertical
y
300
400
500
600
700
800
Time (msec.)
Figure 2.20 Gestural score plus generated movements for /pip_'pip_/, with no activations for schwa. The acoustic interval between the second and third bilabial gestures is perceived as an /i/. Generated movements (curves) are shown for the tongue center and lower lip. The higher the curve, the higher (vertical) or more fronted (horizontal) the corresponding movement. Superimposition of boxes and curves as in figure 2.13
simulations further, generating a shorter acoustic interval for the second vowel (the putative schwa) by decreasing the interval (relative phasing) between the bilabial gestures on either side of it. An example of a score with the bilabial closure gestures closer together is shown in figure 2.21. At relatively short durations as in thefigure(roughly < 50 msec), the percept of the second vowel changed from /i/ to schwa. Thus, the completely targetless organization may be workable in cases where the surrounding consonants are only slightly separated. In fact, this suggests a possible historical source for epenthetic schwa vowels that break up heterosyllabic clusters. They could arise from speakers increasing the distance between the cluster consonants slightly, until they no longer overlap. At that point, our simulations suggest that the resulting structure would be perceived as including a schwa-like vowel. The second set of simulations involving an "unspecified" schwa used the same gestural organization as that portrayed in the score in figure 2.20, except that the V2 gesture was delayed so that it did not begin right at the offset of the VI gesture. Rather, the V2 regime began approximately at the beginning of the third bilabial-closure gesture, as in figure 2.22. Thus, there was an interval of time during which no tongue-body gesture was active, that is, during which there was no active control of the tongue-body tract variables. The motion of the tongue-body center during this interval, then, was determined solely by the neutral positions, relative to the jaw, associated with the tongue-body articulators, and by the motion of the jaw, which was implicated in the ongoing bilabial closure and release gestures. The results, 53
Gesture Waveform Tongue center horizontal (CX) Tongue center vertical (CY) Lower lip vertical 200
300
400
500
600
700
Time (msec.)
Figure 2.21 The same gestural score for /pip_'pip_/ as in figure 2.20, but with the second and third bilabial gestures closer together than in figure 2.20. The acoustic interval between the second and third bilabial gestures is perceived as a schwa. Generated movements (curves) are shown for the tongue center and lower lip. The higher the curve, the higher (vertical) or more fronted (horizontal) the corresponding movement. Superimposition of boxes and curves as in figure 2.13
displayed in figure 2.22, showed that the previous problem with /pip9'pip9/ was solved, since during the unspecified interval between the two full vowels, the tongue-body lowered (from /i/ position) and produced a perceptible schwa. Unfortunately, this "dip" between VI and V2 was seen for all combinations of VI and V2, which was not the case in the X-ray data. For example, this dip can be seen for /papa'papa/ in figure 2.23; in the X-ray data, however, the tongue raised slightly during the schwa, rather than lowering. (The "dip" occurred in all the simulations because the neutral position contributing to the tongue-body movement was that of the tonguebody articulators rather than that of the tongue-body tract variables; consequently the dip was relative to the jaw, which, in turn, was lowering as part of the labial release). In addition, because the onset for V2 was so late, it would not be possible for V2 to affect the schwa at the "lip" reference point, as was observed in the X-ray data. Thus, this hypothesis also failed to capture important aspects of the data. The best hypothesis remains the one tested first - where schwa has a target of sorts, but is still "colorless," in that its target is the mean of all the vowels, and is completely overlapped by the following vowel. 2.4 Conclusion
We have demonstrated how an explicit gestural model of phonetic structure, embodying the possibilities of underspecification ("targetlessness") and 54
2 Catherine P. Browman and Louis Goldstein Waveform Tongue center horizontal (CX) Tongue center vertical (CY) Lower lip vertical 100
200
300
400
500
600
700
800
Time (msec.)
Figure 2.22 The same gestural score for /pip_'pip_/ as in figure 2.20, but with the onset of the second full vowel /i/ delayed. The acoustic interval between the second and third bilabial gestures is perceived as a schwa. Generated movements (curves) are shown for the tongue center and lower lip. The higher the curve, the higher (vertical) or more fronted (horizontal) the corresponding movement. Superimposition of boxes and curves as in figure 2.13
P
Waveform
i)
a
9
p
a
p
Tongue center ^~ horizontal (CX) Tongue center vertical (CY) Lower lip vertical
^-~—
—-^ -^—_
y^
\ 200
300
400
500
-
— r| | 800
Time (msec.)
Figure 2.23 The same gestural score as in figure 2.22, except with tongue targets appropriate for the utterance /pap_'pap_/. The acoustic interval between the second and third bilabial gestures is perceived as a schwa. Generated movements (curves) are shown for the tongue center and lower lip. The higher the curve, the higher (vertical) or more fronted (horizontal) the corresponding movement. Superimposition of boxes and curves as in figure 2.13
temporal overlap ("coproduction"), can be used to investigate the contextual variation of phonetic units, such as schwa, in speech. For the particular speaker and utterances that we analyzed, there was clearly some warping of the V1-V2 trajectory towards a neutral position for an intervening schwa. The analyses showed that this neutral position has to be defined in the space 55
Gesture
of tract variables (the linguistically relevant goal space), rather than being the consequence of neutral positions for individual articulators. Therefore, a target position for schwa was specified, although this target is completely predictable from the rest of the system; it corresponds to the mean tongue tract-variable position for all the full vowels. The temporally overlapping structure of the gestural score played a key role in accounting for the time course of VI and V2 effects on schwa. These effects were well modeled by a gestural score in which active control for schwa was completely overlapped by that for V2. This overlap gave rise to the observed anticipatory effects, while the carry-over effects were passive consequences of the initial conditions of the articulators when schwa and V2 begin. (This fits well with studies that have shown qualitative asymmetries in the nature of carry-over and anticipatory effects [see Recasens 1987].) How well the details of the gestural score will generalize to other speakers and other prosodic contexts remains to be investigated. There is known to be much individual variation in the strength of anticipatory vs. carry-over coarticulation in utterances like those employed here, and also in the effect of stress (Fowler 1981a; Magen 1989). In addition, reduced vowels with different phonological/morphological characteristics, as in the plural (e.g. "roses") and past tense (e.g. "budded") may show different behavior, either with respect to overlap or targetlessness. The kind of modeling developed here provides a way of analyzing the complex quantitative data of articulation so that phonological issues such as these can be addressed.
Comments on Chapter 2 SARAH HAWKINS The speaker's task is traditionally conceptualized as one of producing successive articulatory or acoustic targets, with the transitions between them being planned as part of the production process.* A major goal of studies of coarticulation is then to identify the factors that allow or prevent coarticulatory spread of features, and so influence whether or not targets are reached. In contrast, Browman and Goldstein offer a model of phonology that is couched in gestural terms, where gestures are abstractions rather than movement trajectories. In their model, coarticulation is the inevitable *The structure of this discussion is influenced by the fact that it originally formed part of a joint commentary covering this paper and the paper by Hewlett and Shockey. Since the latter's paper was subsequently considerably revised, mention of it has been removed and a separate discussion prepared. 56
2 Comments
consequence of coproduction of articulatory gestures. Coarticulation is planned only in the sense that the gestural score is planned, and traditional notions of target modification, intertarget smoothing, and look-ahead processes are irrelevant as explanations, although the observed properties they are intended to explain are still, of course, of central concern. Similarly, coarticulation is traditionally seen as a task of balancing constraints imposed by the motoric system and the perceptual system - of balancing ease of articulation with the listener's need for acoustic clarity. These two opposing needs must be balanced within constraints imposed by a third factor, the phonology of the particular language. Work on coarticulation often tries to distinguish these three types of constraint. For me, one of the exciting things about Browman and Goldstein's work is that they are being so successful in linking, as opposed to separating, motoric, perceptual, and phonological constraints. In their approach, the motoric constraints are all accounted for by the characteristics of the taskdynamic model. But the task-dynamic model is much more than an expression of universal biomechanical constraints. Crucially, the task-dynamic model also organizes the coordinative structures. These are flexible, functional groupings of articulators whose organization is not an inevitable process of maturation, but must be learned by every child. Coordinative structures involve universal properties and probably some language-specific properties. Although Browman and Goldstein assign all language-specific information to the gestural score, I suspect that the sort of things that are hard to unlearn, like native accent and perhaps articulatory setting, may be better modeled as part of the coordinative structures within the task dynamics. Thus the phonological constraints reside primarily in the gestural score, but also in its implementation in the task-dynamic model. Browman and Goldstein are less explicitly concerned with modeling perceptual constraints than phonological and motoric ones, but they are, of course, concerned with what the output of their system sounds like. Hence perceptual constraints dictate much of the organization of the gestural score. The limits set on the temporal relationships between components of the gestural score for any given utterance represent in part the perceptual constraints. Variation in temporal overlap of gestures within these limits will affect how the speech sounds. But the amount of variation possible in the gestural score must also be governed by the properties and limits on performance of the parameters in the task-dynamic model, for it is the taskdynamic model that limits the rate at which each gesture can be realized. So the perceptual system and the task-dynamic model can be regarded as in principle imposing limits on possible choices in temporal variation, as represented in the gestural score. (In practice, these limits are determined from measurement of movement data.) Greater overlap will result in greater 57
Gesture
measurable coarticulation; too little or too much overlap might sound like some dysarthric or hearing-impaired speakers. Browman and Goldstein's work on schwa is a good demonstration of the importance of appropriate temporal alignment of gestures. It also demonstrates the importance to acceptable speech production of getting the right relationships between the gestural targets and their temporal coordination. Thus Browman and Goldstein offer a model in which perception and production, and universal and language-specific aspects of the phonology, are conceptually distinguishable yet interwoven in practice. This, to my way of thinking, is as it should be. The crucial issue in work on coarticulation, however, is not so much to say what constraints affect which processes, as to consider what the controlled variables are. Browman and Goldstein model the most fundamental controlled variables: tongue constriction, lip aperture, velar constriction, and so on. There are likely to be others. Some, like fundamental frequency, are not strongly associated with coarticulation but are basic to phonology and phonetics, and some, like aerodynamic variables, are very complex. Let us consider an example from aerodynamics. Westbury (1983) has shown allophonic differences in voiced stops that depend on position in utterance and that all achieve cavity enlargement to maintain voicing. The details of what happens vary widely and depend upon the place of articulation of the stop, and its phonetic context. For example, for initial /b/, the larynx is lowered, the tongue root moves backwards, and the tongue dorsum and tip both move down. For final /b/, the larynx height does not change, the tongue root moves forward, and the dorsum and tip move slightly upwards. In addition, the rate of cavity enlargement, and the time function, also vary between contexts. Does it make sense to try to include these differences? If the task-dynamic system is primarily universal, then details of the sort Westbury has shown are likely to be in the gestural score. But to include them would make the score very complicated. Do we want that much detail in the phonology, and if so, how should it be included? Browman and Goldstein have elsewhere (1990) suggested a tiered system, and if that solution is pursued, we could lose much of the distinction between phonetics and phonology. While I can see many advantages in losing that distinction, we could, on the other hand, end up with a gestural score of such detail that some of the things phonologists want to do might become undesirably clumsy. The description of phonological alternations is a case in point. So to incorporate these extra details, we will need to consider the structure and function of the gestural score very carefully. This will include consideration of whether the gestural score really is the phonology-phonetics, or whether it is the interface between them. In other words, do we see in the gestural score the phonological primitives, or their output? Browman and 58
2 Comments
Goldstein say it is the former. I believe they are right to stick to their strong hypothesis now, even though it may need to be modified later. Another issue that interests me in Browman and Goldstein's model is variability. As they note, the values of schwa that they produce are much less variable than in real speech. There are a number of ways that variability could be introduced. One, for schwa in particular, is that its target should not be the simple average of all the vowels in the language, as Browman and Goldstein suggest, but rather a weighted average, with higher weighting given to the immediately preceding speech. How long this preceding domain might be I do not know, but its length may depend on the variety of the preceding articulations. Since schwa is schwa basically because it is centralized relative to its context, schwa following a lot of high articulations could be different from schwa in the same immediate context but following a mixture of low and high articulations. A second possibility, not specific to schwa, is to introduce errors. The model will ultimately need a process that generates errors in order to produce real-speech phenomena like spoonerisms. Perhaps the same type of system could produce articulatory slop, although I think this is rather unlikely. If the variability we are seeking for schwa is a type of articulatory slop, it could also be produced by variability in the temporal domain. In Browman and Goldstein's terms, the phase relations between gestures may be less tightly tied together than at present. A fourth possibility is that the targets in the gestural score could be less precisely specified. Some notion of acceptable range might add the desired variability. This idea is like Keating's (1988a) windows, except that her windows determine an articulator trajectory, whereas Browman and Goldstein's targets are realized via the task-dynamic model, which adds its own characteristics. Let me finish by saying that one of the nice things about Browman and Goldstein's work is how much it tells us that we know already. Finding out what we already know is something researchers usually hope to avoid. But in this case we "know" a great number of facts of acoustics, movement, and phonology, but we do not know how they fit together. Browman and Goldstein's observations on intrusive schwa, for example,fitwith my own on children's speech (Hawkins 1984: 345). To provide links between disparate observations seems to me to achieve a degree of insight that we sorely need in this field.
59
Gesture
Comments on Chapter 2 JOHN KINGSTON Introduction
Models are valued more for what they predict, particularly what they predict not to occur, than what they describe. While the capacity of Browman and Goldstein's gestural model to describe articulatory events has been demonstrated in a variety of papers (see Browman et al. 1984; Browman and Goldstein 1985, 1986, 1990; Browman et al. 1986), and there is every reason to hope that it will continue to achieve descriptive success, I am less sanguine about its predictive potential. The foundation of my pessimism is that gestural scores are not thus far constructed in terms of independent principles which would motivate some patterns of gestural occurrence and coordination, while excluding others. "Independent principles" are either such as constrain nonspeech and speech movement alike, or such as arise from the listener's demands on the speaker. That such principles originate outside the narrowly construed events of speaking themselves guards models built on them from being hamstrung by the ad hoc peculiarities of speech movements. The scores' content is constrained by the limited repertoire of gestures used, but because gestures' magnitude may be reduced in casual speech, even to the point of deletion (Browman and Goldstein 1990), the variety of gestures in actual scores is indefinitely large. Further constraints on the interpretation of scores come from the task dynamics, which are governed by principles that constrain other classes of movements (see Kelso et al. 1980; Nelson 1983; Ostry, Keller, and Parush 1983; Saltzman and Kelso 1987). The task dynamics rather than the gestural score also specify which articulatory movements will produce a particular gesture. The gestural score thus represents the model's articulatory goals, while the specific paths to these goals are determined by entirely dynamical means. Gestural coordination is not, however, constrained by the task dynamics and so must be stipulated, and again the number of possible patterns is indefinitely large. Despite this indefiniteness in the content and coordination of scores, examining the articulation of schwa should be informative about what is in a score and how the gestures are coordinated, even if in the end Browman and Goldstein's account does not extend beyond description. The next two sections of this commentary examine Browman and Goldstein's claim that English schwa has an articulation of its own. This examination is based on an extension of their statistical analysis and leads to a partial rejection of their claim. In the final section, the distinction between predictive vs. descriptive models is taken up again. 60
2 Comments Table 2.3 Variances for lip and tongue reference positions
Lip Tongue
MX
MY
RX
RY
910.2 558.6
2290.7 959.0
364.4 325.9
1111.2 843.4
Does schwa have its own target?
Browman and Goldstein found that the positions of two tongue pellets (MX-MY and RX-RY) in a schwa between flanking full vowels closely match the grand mean of pellet positions in the full vowels, implying that during schwa the tongue simply has whatever position is dictated by the transition between the full vowels that flank it. Therefore, when flanking vowels are identical, the tongue should not deviate from the full vowel positions during schwa. However, Browman and Goldstein's data show the tongue does move away from these positions and back again during the schwa, implying it does have its own target. Giving schwa its own target is supported by the stepwise regression in which including a constant factor representing effects independent of either of the flanking vowels yielded a smaller residual variance. Schwa's target only looks transitional because it is very close to the grand mean of the tongue positions of all the full vowels. The stepwise regression also revealed that the tongue position during schwa was determined more by V2 than VI, perhaps because V2 was more prominent than VI. Influences on the tongue position for schwa were assessed by comparing standard errors for multiple-regression models containing different combinations of terms for VI, and V2, and k, the independent schwa factor. "Standard error" is the standard error of estimate (SE), a measure of the residual variance not accounted for by the terms in the regression models. The standard error of estimate is the term on the right, the square root of the residual mean square, in (1) (Cohen and Cohen 1983: 104);
(1) Formula for the standard error of estimate (q is the number of terms in the regression model) which shows that SB's magnitude is not only a function of the proportion of variance not accounted for, 1 — R\ but also of the overall magnitude of variance in the dependent measure, E (Y — Y)2. Since the magnitude of this latter variance will differ 61
Gesture 2
Table 2.4 Shrunken R sfor lip reference positions
MX MY RX RY
k + Vl+V2
k + Vl
0.879 0.957 0.750 0.912
0.848 0.945
k + V2
V1+V2
k
VI
0.818 0.882 0.654 0.839
0.752 0.864
0.731 0.885
0.846 0.917 0.517 0.880
VI
V2
0.157 0.834
Table 2.5 Shrunken R2s for tongue reference positions k + Vl+V2 MX MY RX RY
0.806 0.882 0.631 0.872
k + Vl 0.832
k + V2
V1+V2
k
0.793
0.712 0.761 0.406 0.842
0.709 0.761 0.533 0.778
0.631 0.863
0.586
V2 0.491 0.145 0.778
between dependent variables, the absolute magnitude of the SEs for models with different dependent variables cannot be compared. Accordingly, to evaluate how well the various regression models fare across the pellet positions, a measure of variance should be used that is independent of the effect of different variances among the dependent variables, i.e. R\ rather than SE. More to the point, the R2s can be employed in significance tests of differences between models of the same .dependent variable with different numbers of terms. The equation in (1) can be solved for R2, but only if one knows L(Y — Y)2 (solving this equation for R2 shows that the values listed by Browman and Goldstein must be the squared standard error of estimate). This variance was obtained from Browman and Goldstein's figures, which plot the four tongue coordinates; measurements were to the nearest division along each axis, and their precision is thus ± 1 mm for MY and RY, ±0.55 mm for RX, and ±0.625 mm for MX (measurement error in either direction is roughly equal to half a division for each of the pellet coordinates). The variances obtained do differ substantially for the four measures (see table 2.3), with the variances for vertical position consistently larger than for horizontal position at both reference points. The resulting shrunken R2s for the various regression models at lip and tongue reference positions are shown in tables 2.4 and 2.5 (the gaps in these tables are for regression models not considered by the stepwise procedure). Shrunken R2s are given in these tables because they are a better estimate of the proportion of variance 62
2 Comments
accounted for in the population from which the sample is taken when the ratio of independent variables q to n is large, as here, and when independent variables are selected post hoc, as in the step wise regression. (Shrunken R2s were calculated according to formula (3.6.4) in Cohen and Cohen (1983: 106-7), in which q was always the total number of independent variables from which selections were made by the stepwise procedure, i.e. 3.) The various models were tested for whether adding a term to the equation significantly increased the variance, assuming Model I error (see Cohen and Cohen 1983: 145-7). Comparisons were made of k + Vl + V2 with k + VI or k + V2 and with V1+V2. The resulting F-statistics confirmed Browman and Goldstein's contention that adding VI to the k + V2 model does not increment the variance significantly at MX, RX, or RY at the tongue reference positions (for k + Vl+V2vs. k + V2:MXF (219) = 0.637,p>0.05; RX F(2l9) = 0,/>>0.05; and RY F(219) = 0.668, p > 0.05) and also supports their observation that both VI and V2 increment R2 substantially for MY at the tongue reference positions (for k +VI + V2 vs. k +VI, F(2l9) = 4.025, p< 0.05). However, their claim that for the lip reference position, the two-term models k + Vl or k + V2 account for substantially less variance than the three-term model k + Vl + V2 is not supported, for any dependent variable (for k + Vl +V2 vs. k + Vl: MX F(219) = 2.434, p > 0.05 and MY F(2l9) = 2.651, p > 0.05, and for k + Vl+V2 vs. k + V2: RXF(2>19) = 0.722,/? > 0.05 and RY F{219) = 2.915, p > 0.05). Comparisons of the other two-term model, V1+V2, with k + Vl+V2 yielded significant differences for MY (F(2l9) = 8.837,/? < 0.01) and RX(F(219) = 8.854,/? < 0.01), but for neither dependent variable was V1+V2 the second-best model. At MX and RY, the differences in amount of variance accounted for by VI +V2 vs. k + Vl (MX) or k + V2 (RY) are very small (less than half of 1 percent in each case), so choosing the second-best model is impossible. In any case, there is no significant increment in the variance in the three-term model, k + Vl+V2, with respect to the V1+V2 two-term model at MX (F(219) = 2.591, p > 0.05) or RY (F(2>19) = 2.483, p > 0.05). Thus at the lip reference position, schwa does not coarticulate strongly with V2 at MX or MY, nor does it coarticulate strongly with VI at RX or RY. There is evidence for an independent schwa target at MY and RX, but not MX or RY. Use of R2s rather than SEs to evaluate the regression models has thus weakened Browman and Goldstein's claims regarding both schwas having a target of its own and the extent to which it is coproduced with flanking vowels.
63
Gesture Are all schwas the same?
Whether schwa has an independent target depends on how much the schwa factor contributes to R2 when the tokens with identical flanking vowels are omitted. If schwa's target is the grand mean of all the full vowel articulations, then the schwa factor should contribute substantially less with this omission, since the tongue should pass through that position on its path between differing but not identical full vowels. Schwas may be made in more than one way, however: between unlike vowels, schwa may simply be a transitional segment, but between like vowels, a return to a more neutral position might have to be achieved, either by passive recoil if schwas are analogous to the "trough" observed between segments which require some active articulatory gesture (Gay 1977, 1978; cf. Boyce 1986) or by means of an active gesture as Browman and Goldstein argue. (Given the critical damping of gestures in the task dynamics, one might expect passive recoil to achieve the desired result, but then why is a specified target needed for schwa?) On the other hand, if the schwa factor contributes nearly the same amount to R2 in regression models where the identical vowel tokens are set aside, then there is much less need for multiple mechanisms. Finally, one may ask whether schwas are articulated with the same gesture when there is no flanking full vowel, on one side or the other, as in the first schwa of Pamela or the second schwa in Tatamagouchi. A need more fundamental than looking at novel instances of the phenomenon is for principles external to the phenomena on which the modeling is based, which would explain why one gestural score is employed and not others. I point to where such external, explanatory principles may be found in the next section of this commentary. Description vs. explanation The difficulty I have with the tests of their gestural model that Browman and Goldstein present is that they stop when an adequate descriptive fit to the observed articulatory trajectories was obtained. Lacking is a theory which would predict the particular gestural scores that most closely matched the observed articulations, on general principles. (The lack of predictive capacity is, unfortunately, not a problem unique to Browman and Goldstein's model; for example, we now have excellent descriptive accounts of how downtrends in F o are achieved (Pierrehumbert 1980; Liberman and Pierrehumbert 1984; Pierrehumbert and Beckman 1988), but still very little idea of why downtrends are achieved with the mechanisms identified, why these mechanisms are employed in all languages with downtrends, or even why downtrends are so ubiquitous.) If V2's gesture overlaps more with the preceding schwa than Vl's because it is more prominent, why should prominence have this effect on 64
2 Comments
gestural overlap? On the other hand, if more overlap is always observed between the schwa and the following full vowel, why should anticipatory coarticulation be more extensive than carry-over? And in this case, what does greater anticipatory coarticulation indicate about the relationship between the organization of gestures and the trochaic structure of stress feet in English? All of these are questions that we might expect an explanatory or predictive theory of gestural coordination to answer. The gestural theory developed by Browman and Goldstein may have all the pieces needed to construct a machine that will produce speech, indeed, it is already able to produce particular speech events, but as yet there is no general structure into which these pieces may be put which would produce just those kinds of speech events that do occur and none of those that do not. Browman and Goldstein's gestural theory is not incapable of incorporating general principles which would predict just those patterns of coordination that occur; the nature of such principles is hinted at by Kelso, Saltzman, and Tuller's (1986a) replication of Stetson's (1951) demonstration of a shift from a VC to CY pattern of articulatory coordination as rate increased. Kelso, Saltzman, and Tuller suggest that the shift reflects the greater stability of CV over VC coordination, but it could just as well be that place and perhaps other properties of consonants are more reliably perceived in the transition from C to V than from V to C (see Ohala 1990 and the references cited there, as well as Kingston 1990 for a different view). If this latter explanation is correct, then the search for the principles underlying the composition of gestural scores must look beyond the facts of articulation, to examine the effect the speaker is trying to convey to the listener and in turn what articulatory liberties the listener allows the speaker (see Lindblom 1983, Diehl and Kluender 1989, and Kingston and Diehl forthcoming for more discussion of this point).
Comments on Chapter 2 WILLIAM BARRY In connection with Browman and Goldstein's conclusion that schwa is "weak but not completely targetless," I should like to suggest that they reach it because their concept of schwa is not totally coherent with the model within which the phenomenon "neutral vowel" is being examined. The two "nontarget" simulations that are described represent two definitions: 1 A slot in the temporal structure which is empty with regard to vowel quality, the vowel quality being determined completely by the preceding and 65
Gesture following vowel targets. This conflicts, in spirit at least, with the basic concept of a task-dynamic system, which explicitly evokes the physiologically based "coordinative structures" of motor control (Browman and Goldstein 1986). A phonologically targetless schwa could still not escape the residual dynamic forces of the articulatory muscular system, i.e. it would be subject to the relaxation forces of that system. 2 A relaxation target. The relaxation of the tongue-height parameter in the second simulation is an implicit recognition of the objection raised in point 1, but it still clashes with the "coordinative" assumption of articulatory control, which argues against one gesture being relaxed independent of other relevant gestural vowel parameters. If an overall "relaxation target" is accepted, then, from a myofunctional perspective there is no means of distinguishing the hypothesized "targetless" schwa from the schwa-target as defined in the paper. Any muscle in a functional system can only be accorded a "neutral" or "relaxation" value as a function of the forces brought to bear on it by other muscles within the system. These forces will differ with each functional system. The rest position for quiet respiration (velum lowered, lips together, mandible slightly lowered, tongue tip on alveolar ridge) is different from the preparatory position found prior to any speech act independent of the character of the utterance onset (lips slightly apart, velum raised, jaw slightly open, laryngeal adduction). The relaxation position may, therefore, be seen as a product of the muscular tensions required by any functional system, and implicit support for this view is given by Browman and Goldstein's finding that the mean back and front tongue height for schwa is almost identical with the mean tongue heights for all the other vowels. In other words, the mean vowel specifying the schwa "target" used by Browman and Goldstein is identical with the relaxation position of the vocalic functional system since it reflects the balance of forces between the muscle-tension targets specified for all the other vowels within the system. This accords nicely with the accepted differences in neutral vowel found between languages, and allows a substantive definition of the concept of "basis of articulation" which has been recognized qualitatively for so long (Franke 1889; Sievers 1901; Jespersen 1904, 1920; Roudet 1910). This implies that, phonologically, schwa can in fact be regarded as undefined or "targetless," a status in keeping with its optional realization in many cases before sonorant consonants, and its lack of function when produced epenthetically. One difficult question is the physiological definition and delimitation of a functional system, as it is mainly the scientific area of inquiry and the level of descriptive delicacy which defines a function. Since the same muscles are used 66
2 Comments
for many different functions, a total physiological independence of one functional system from another using the same muscles cannot be expected. A critical differentiation within speech, for example, is between possible vocalic vs. consonantal functional subsystems. It has long been postulated as descriptively convenient and physiologically supportable that consonantal gestures are superimposed on an underlying vocalic base (Ohman 1966a; Perkell 1969; Hardcastle 1976). Browman and Goldstein's gestural score is certainly in accordance with this view. A resolution of the problem within the present discussion is not necessary, however, since the bilabial consonantal context is maximally independent of the vocalic system and is kept constant.
67
3 Prosodic structure and tempo in a sonority model of articulatory dynamics MARY BECKMAN, JAN EDWARDS, and JANET FLETCHER
3.1 Introduction
One of the most difficult facts about speech to model is that it unfolds in time.* The phonological structure of an utterance can be represented in terms of a timeless organization of categorical properties and entities phonemes in sequence, syllables grouped into stress feet, and the like. But a phonetic representation must account for the realization of such structures as physical events. It must be able to describe, and ultimately to predict, the time course of the articulators moving and the spectrum changing. Early studies in acoustic phonetics demonstrated a plethora of influences on speech timing, with seemingly complex interactions (e.g. Klatt 1976). The measured acoustic durations of segments were shown to differ widely under variation in overall tempo, in the specification of adjacent segments, in stress placement or accentuation, in position relative to phrase boundaries, and so on. Moreover, the articulatory kinematics implicated in any one linguistic specification - tempo or stress, say - showed a complicated variation across speakers and conditions (e.g. Gay 1981). The application of a general model of limb movement (task dynamics) shows promise of resolving this variation by relating the durational correlates of tempo and stress to the control of dynamic parameters such as gestural stiffness and amplitude (e.g. Kelso et al. 1985; Ostry and Munhall 1985). However, the mapping between these *Haskins Laboratories generously allowed us to use their optoelectronic tracking system to record the jaw-movement data. These recordings were made and processed with the assistance of Keith Johnson and Kenneth De Jong. Madalyn Ortiz made the measurements of gesture durations, displacements, and velocities, and Maria Swora made the Fo tracks and supplied one transcription of the intonation patterns. The work reported in this paper was supported by the National Science Foundation under grant number IRI-8617873 to Jan Edwards and grants IRI861752 and IRI-8858109 to Mary Beckman. Support was also provided by the Ohio State University in various ways, including a Postdoctoral Fellowship to Janet Fletcher. 68
3 M. Beckman, J. Edwards, and J. Fletcher
parameters and the underlying phonological specification of prosodic structure is not yet understood. A comparison of the articulatory dynamics associated with several different lengthening effects suggests an approach to this mapping. This paper explores how the general task-dynamic model can be applied to the durational correlates of accent, of intonation-phrase boundaries, and of slower overall speaking tempo. It contrasts the descriptions of these three different effects in a corpus of articulatory measurements of jaw movement patterns in [pap] sequences. We will begin by giving an overview of the task-dynamic model before presenting the data, and conclude by describing what the data suggest concerning the nature of timing control. We will propose that, underlying the more general physical representation of gestural stiffness and amplitude, there must be an abstract language-specific phonetic representation of the time course of sonority at the segmental and various prosodic levels. 3.2 The task-dynamic model
In the late 1970s, a group of speech scientists at Haskins Laboratories proposed that speech production can be described using a task-dynamic model originally developed to account for such things as the coordination of flexor and extensor muscles in the control of gait (see, e.g., Fowler et al. 1980). In the decade since, safer techniques for recording articulator movements have been refined, allowing large-scale studies in which multiple repetitions of two or three syllable types can be examined for more informative comparisons among different linguistic and paralinguistic influences on durational structure (e.g. Ostry, Keller, and Parush 1983; Kelso et al. 1985; Ostry and Munhall 1985). These studies showed patterns in the relationships among kinematic measures in speech gestures that are similar to those displayed by limb movements in walking, reaching, and the like, lending plausibility to the proposed application to speech of the taskdynamic model. More recently, the application has been made feasible by the development of a mathematics capable of describing crowded gestures in sequence (see Saltzman 1986; Saltzman and Munhall 1989) and by the development of a system for representing segmental features directly in terms of task-dynamic specifications of articulatory gestures (see Browman and Goldstein 1986, 1988, 1990, this volume). A fundamental assumption in this application of task dynamics is that speech can be described as an orchestration of abstract gestures that specifies (within a gesture) stiffness and displacement and (between gestures) relative phase (see Hawkins, this volume, for an overview). To interpret articulatory kinematics in terms of this model, then, we must look at relationships among movement velocity, displacement, and duration for indications of the under69
Gesture
lying dynamic specifications of intragestural amplitude and stiffness and of intergestural phase. (Note that we use the term displacement for the observed kinematic measure, reserving amplitude for the underlying dynamic specification in the model.) Consider first how the model would deal with linguistic contrasts that essentially change gestural amplitude - for example, featural specifications for different degrees of vowel constriction. If the model is correct, the observed velocities of relevant articulators should correlate positively with the observed displacements. Such a relationship is seen in Ostry and Munhall's (1985) data on tongue-dorsum movement in /kV/ sequences. In this corpus, the size of the opening gesture into the vowel varied from a small dorso-velar displacement for /ku/ to a large displacement for /ka/. At the same time, movement velocity also varied from slow to fast, and was strongly correlated with displacement, indicating a relatively constant dorsal stiffness over the different vowel qualities. The task-dynamic model provides an intuitively satisfying explanation for the variation in gestural velocity. Without the reference to the underlying dynamic structure, the inverse relationship between vowel height and velocity is counterintuitive at best. Consider next how the model would deal with a contrast that presumably does not specify different amplitudes, such as overall tempo variation for the same vowel height. Here we would predict a somewhat different relationship between observed velocity and displacement. In Kelso et al.'s (1985) study of lower lip gestures in /ba/ sequences, regression functions for peak velocity against displacement generally had higher slopes for fast as compared to slow tokens. As the authors point out, such differences in slope can be interpreted as indicating different stiffnesses for the gestures at the different tempi. The gesture is faster at fast tempo because it is stiffer. Because it is stiffer without being primarily larger in amplitude, it should also be shorter, a prediction which accords also with other studies. Finally, consider how phase specifications can affect the observed kinematics of a gesture. For example, if an articulator is specifically involved in the oral gestures for a stop and both adjacent vowels, as in the tonguedorsum raising and lowering gestures of an /aka/ sequence, then undershoot of the stop closure might occur if the opening gesture is phased very early relative to the closing gesture, resulting in an apparent replacement of /k/ with [x]. Browman and Goldstein (1990) have proposed that this sort of variation in the phasing of gestures accounts for many consonant lenitions and even deletions in casual speech. In cases of less extreme overlap, the early phasing of a following gesture might obscure a gesture's underlying amplitude specification without effecting a perceptible difference in segmental quality. Bullock and Grossberg (1988) have suggested that such truncation is the rule in the typically dense succession of gestures in fluent speech. Results 70
3 M. Beckman, J. Edwards, and J. Fletcher
0.040 0.060 0.080 0.100
Amplitude (mm)
0.120
Displacement/velocity (sec.)
Figure 3.1 Predicted relationships among kinematic measures for sequences of gestures with (b)-(d) differing stiffnesses, (e)-(g) differing displacements, and (h)-(j) differing intergestural phasings
by Nittrouer et al. (1988) suggest that varying degrees of truncation associated with differences in intrasyllabic phase relationships may underlie minor effects on vowel quality of lexical stress contrasts. Figure 3.1 summarizes the kinematic patterns that should result from varying each of the three different dynamic specifications. In a pure stiffness change, peak velocity would change but observed displacement should remain unchanged, as shown in figure 3.1c. Duration should be inversely proportional to the velocity change - smaller velocities going with longer 71
Gesture
Table 3.1 Test sentences la lb
^obligatory intonation phrase break Popfopposing the question strongly, refused to answer it. Poppa, posing the question loudly, refused to answer it.
2a 2b
JIO phrase break likely Pop opposed the question strongly, and so refused to answer it. Poppa posed the question loudly, and then refused to answer it.
durations. In this case, the ratio of the displacement to the peak velocity should be a good linear predictor of the observed duration (fig. 3.Id). In a pure amplitude change as well, peak velocity should change, but here observed displacement should also change, in constant proportion to the velocity change (fig. 3.If). In accordance with the constant displacementvelocity ratio, the observed duration is predicted to be fairly constant (fig. 3.1g). Finally, in a phase change, peak velocity and displacement might remain relatively unchanged (fig. 3.1i), but the observed duration would change as the following gesture is phased earlier or later; it would be shorter or longer than predicted by the displacement-velocity ratio (fig. 3.1j). If the following gesture is phased early enough, the effective displacement might also be measureably smaller for the same peak velocity ("truncated" tokens in figs. 3.1i and 3.1j). 3.3 Methods
In our experiment, we measured the kinematics of movements into and out of a low vowel between two labial stops. These [pap] sequences occurred in the words pop versus poppa in two different sentence types, shown in table 3.1. In the first type, the target word is set off as a separate intonation phrase, and necessarily bears a nuclear accent. In the other type, the noun is likely to be part of a longer intonation phrase with nuclear accent falling later. We had four subjects read these sentences at least five times in random order at each of three self-selected tempi. We used an optoelectronic tracking system (Kay et al. 1985) to record jaw height during these productions. We looked at jaw height rather than, say, lower-lip height because of the jaw's contribution both to overall openness in the vowel and to the labial closing gesture for the adjacent stops. We defined jaw opening and closing gestures as intervals between moments of zero velocity, as shown infigure3.2, and we measured their durations, displacements, and peak velocities. We also made 72
3 M. Beckman, J. Edwards, and J. Fletcher
Figure 3.2 Sample jaw height and velocity traces showing segmentation points for vowel opening gesture and consonant closing gesture in Poppa
Fo tracks of the sentences, and had two observers independently transcribe the intonation patterns. Figure 3.3 shows sample Fo tracks of utterances of sentences la and 2a (table 3.1). The utterances show the expected phrasings, with an intonationphrase boundary after thzpop infigure3.3a but no intonational break of any kind after the pop infigures3.3b and 3.3c. All four subjects produced these two contrasting patterns of phrasing for all of the tokens of the contrasting types. For sentences la and lb, this phrasing allows only one accentuation, a nuclear accent on the pop or poppa, as illustrated by subject KDJ's production in figure 3.3a. For sentences of type 2, however, the phrasing is consistent with several different accent patterns. Infigure3.3b, for example, there is a prenuclear accent on pop, whereas in figure 3.3c, the first accent does not occur until the following verb, making pop completely unaccented. Subjects KAJ and CDJ produced a variety of the possible accentuations for this phrasing. Subjects JRE and KDJ, on the other hand, never produced a prenuclear accent on the target word. They consistently put the first pitch accent later, as in JRE's utterance in figure 3.3c. For these two subjects, therefore, we can use the unaccented first syllable of poppa posed and the nuclear-accented syllable of poppa, posing to examine the durational correlates of accent. 3.4 The kinematics of accent
Figure 3.4 shows the mean durations, displacements, and peak velocities of the opening and closing gestures for this accent contrast for subject JRE, averaged by tempo. Examining the panels in thefigurerow by row, we see 73
Gesture Hz
(a) KAJ
180
r
140
LL
100
H% I
opposing the question strongly,
- Pop,
•
1
.
.
,
,
,
•
•
•
,
2 ,
|
(b)
H*+L
180
^
-
KAJ H* + L
140
\
H*+L
H*
L 100
Pop opposed the question strongly i
i
.
.
.
.
.
1 .
i
.
i
, 2 ,
(c) JRE
280
H+L*
\ 220
140
H%
Pop opposed
the question strongly, i
i
i
i
i
"z ,
Time (sec.) Figure 3.3 Sample Fo contours for target portions of sentences la and 2a (table 3.1) produced by subjects KAJ and JRE
first that the gestures are longer in the nuclear-accented syllable in poppa, posing. Note also the distribution of the durational increase; at all three tempi, it affects both the opening and closing gesture of the syllable, although it affects the closing gesture somewhat more. In the next row we see that the gestures are larger in the accented syllable. Again, the increase in the kinematic measure affects both the opening and the closing gesture; both move about 2 mm further. Finally, in the last row we see the effect of accent on the last kinematic measure. Here, by contrast, there is no consistent 74
3 M. Beckman, J. Edwards, and J. Fletcher o — o Accented,
. 300
Opening gesture
• — • Unaccented
Subject JRE
Closing gesture
300-
fast
normal
slow
fast
normal
slow
Figure 3.4 Mean durations, displacements, and peak velocities of opening and closing gestures for nuclear-accented syllables (in Poppa, posing) vs. unaccented syllables (in Poppa posed) for subject JRE
pattern. The opening gesture is faster in the accented syllable, but the closing gesture is clearly not. The overall pattern of means agrees with Summers's (1987) results for accented and unaccented monosyllabic nonsense words. Interpreting this pattern in terms of intragestural dynamics alone, we would be forced to conclude that accent is not realized as a uniform change in a single specification for the syllable as a whole. A uniform increase in the amplitude specification for the accented syllable would be consistent with the greater 75
Gesture accented O, unaccented # 550
150 0.100
0.150
0.200
0.250
0.300
Predicted syllable duration (sec.) = Idisplacement/velocity (mm/[mm/sec]) Figure 3.5 Observed syllable durations against predicted syllable durations for the accented vs. unaccented syllables in figure 3.4
displacements of both gestures and with the greater velocity of the opening gesture, but not with the velocity of the closing gesture. A decrease in stiffness for the accented syllable could explain the increased durations of the two gestures, but it must be just enough to offset the velocity increase caused by the increased displacement of the closing gesture. If we turn to the intergestural dynamics, however, we can explain both the displacement and the length differences in terms of a single specification change: a different phasing for the closing gesture relative to the opening gesture. That is, the opening gesture is longer in the accented syllable because its gradual approach towards the asymptotic target displacement is not interrupted until later by the onset of the closing gesture. Consequently, its effective displacement is larger because it is not truncated before reaching its target value in the vowel. The closing gesture, similarly, is longer because the measured duration includes a quasi-steady-state portion where its rapid initial rise is blended together with the gradual fall of the opening gesture's asymptotic tail. And its displacement is larger because it starts at a greater distance from its targeted endpoint in the following consonant. Figure 3.5 shows some positive evidence in favor of this interpretation. The value plotted along the y-axis in this figure is the observed overall duration of each syllable token, calculated by adding the durations of the opening and closing gestures. The value along the x-axis is a relative measure 76
3 M. Beckman, J. Edwards, and J. Fletcher
of the predicted duration, calculated by adding the displacement-velocity ratios. Recalling the relationships described above in figure 3.1, we expect that as long as the phasing of the closing gesture relative to the opening gesture is held constant, the observed durations of the gestures should be in constant proportion to their displacement-velocity ratios. Therefore, if the closing gesture's phase is the same between the accented and unaccented syllables in figure 3.5, all of the tokens should lie along the same regression line, with the longer nuclear-accented syllables lying generally further to the upper right. As the figure shows, however, the relationship between measured and predicted durations is different for the two accent conditions. For any given predicted duration, the measured duration of an accented syllable is larger than that for an unaccented syllable. This difference supports our interpretation of the means. An accented syllable is longer not because its opening and closing gestures are less stiff, but because its closing gesture is substantially later relative to its opening gesture; the accented syllable is bigger, in the sense that the vocal tract is maximally open for a longer period of time. We note further that this horizontal increase in size is accompanied by an effective vertical increase; the jaw moves further in the accented syllable because the opening gesture is not truncated by the closing gesture. Since the sound pressure at the lips is a function not just of the source pressure at the glottis but also of the general openness of the vocal tract, an important consequence of the different jaw dynamics is that the total acoustic intensity should be substantially larger in the accented syllable, a prediction which accords with observations in earlier acoustic studies (see Beckman 1986 for a review). 3.5 The kinematics of final lengthening
These kinematic patterns for the accentual contrast differ substantially from those for the effect of phrasal position, as can be seen by comparingfigure3.4 with 3.6. This figure plots the durations, displacements, and peak velocities of opening and closing gestures in the intonation-phrase-final [pap] of pop, opposing vs. the nonfinal [pap] of poppa, posing for the same subject. Looking first at the mean durations, we see that, compared to the durational increase associated with accent, the greater length in intonation-phrase-final position is not distributed so evenly over the syllable. It affects the opening gesture considerably less and the closing gesture substantially more. The patterns for the mean displacements also are different. Unlike the greater length of a nuclear-accented syllable, intonation-phrase-final lengthening is not accompanied by any significant difference in articulator displacement. At all three tempi, the jaw moves about the same distance for final as for nonfinal syllables. The peak velocities for these gestures further illuminate these 77
Gesture final o—o, Opening gesture
300-
non-final •
Subject JRE
• Closing gesture
300-
I 250--
|
200
+
c 150--
2
fast
normal
slow
fast
normal
slow
Figure 3.6 Mean durations, displacements, and peak velocities of opening and closing gestures for phrase-final syllables (in Pop, opposing) versus nonfinal syllables (in Poppa, posing) for subject JRE
differences. The opening gesture of a final syllable is as fast as that of a nonfinal syllable, as would be expected, given their similar durations and displacements. However, the closing gestures are much slower for the longer phrase-final syllables. The extreme difference in the closing gesture velocities unaccompanied by any difference in displacement suggests a change in articulator stiffness; the phrase-final gesture is longer because it is less stiff. In sum, by contrast to the lengthening associated with accent, final 78
3 M. Beckman, J. Edwards, and J. Fletcher
final O, non-final 600
200
0.100
0.150
0.200
0.250
0.300
Predicted syllable duration (sec.) = Xdisplacement/velocity (mm/[mm/sec]) Figure 3.7 Observed syllable durations against predicted syllable durations for the final vs. nonfinal syllables in figure 3.6
lengthening makes a syllable primarily slower rather than bigger. That is, phrase-final syllables are longer not because their closing gestures are phased later, but rather because they are less stiff. In terms of the underlying dynamics, then, we might describe final lengthening as an actual targeted slowing down, localized to the last gesture at the edge of a phrase. Applying the reasoning that we used to interpret figure 3.5 above, this description predicts that the relationship between observed duration and predicted duration should be the same for final and nonfinal closing gestures. That is, in figure 3.7, which plots each syllable's duration against the sum of its two displacement-velocity ratios, the phrase-final tokens should be part of the same general trend, differing from nonfinal tokens only in lying further towards the upper-right corner of the graph. However, the figure does not show this predicted pattern. While the fast tempo and normal tempo tokens are similar for the two phrasal positions, the five slow-tempo phrase-final tokens are much longer than predicted by their displacement-velocity ratios, making the regression curve steeper and pulling it away from the curve for nonfinal tokens. The meaning of this unexpected pattern becomes clear when we compare our description of final lengthening as a local slowing down to the overall slowing down of tempo change. 79
Gesture
final O
O, non-final
600
200
fast
normal
slow
Figure 3.8 Mean syllable duration for fast, normal, and slow tempi productions of thefinaland nonfinal syllables in figure 3.6
3.6 The kinematics of slow tempo
Our first motivation for varying tempo in this experiment was to provide a range of durations for each prosodic condition to be able to look at relationships among the different kinematic measurements. We also wanted to determine whether there is any upper limit on lengthening due to, say, a lower limit on gestural stiffness. We paid particular attention, therefore, to the effects of tempo variation in the prosodic condition that gives us the longest durational values: the nuclear-accented phrase-final syllables of pop, opposing.
Figure 3.8 shows the mean overall syllable duration of this phrase-final [pap] for the same subject shown above infigures3.4-3.7. For comparison to tempo change in shorter syllables, the figure also shows means for the nonfinal [pap] of poppa, posing. As acoustic studies would predict, there is a highly significant effect of tempo on the overall duration of the syllable for both phrasal positions. This is true for slowing down as well as for speeding up. Speeding-up tempo resulted in shorter mean syllable durations in both phrasal positions. Conversely, going from normal to slow tempo resulted in longer mean syllable durations. Note that these tempo effects are symmetrical; slowing down tempo increases the syllable's duration as much as speeding up tempo reduces its duration. Figure 3.9 shows the mean durations, displacements, and peak velocities for the opening and closing gestures in these syllables. (These are the same 80
3 M. Beckman, J. Edwards, and J. Fletcher slow o — o , normal •
300
Opening gesture
•,
Subject JRE
fast A — A
Closing gesture
300-
1601301007040-
final
nonfinal
1 final
1 nonfinal
Figure 3.9 Mean durations, displacements, and peak velocities of opening and closing gestures for fast, normal, and slow tempi productions of final and nonfinal syllables shown previously in figure 3.8
data as in fig. 3.6, replotted to emphasize the effect of tempo.) For the opening gesture, shown in the left-hand column, slowing down tempo resulted in longer movement durations and lower peak velocities, unaccompanied by any change in movement size. Conversely, speeding up tempo resulted in overall shorter movement durations and higher peak velocities, again unaccompanied by any general increase in displacement. (The small increase in displacement for the nonfinal syllables is statistically significant, but very small when compared to the substantial increase in velocity.) These 81
Gesture
patterns suggest that, for JRE, the primary control parameter in tempo variation is gestural stiffness. She slows down tempo by decreasing stiffness to make the gestures slower and longer, whereas she speeds up tempo by increasing stiffness to make the gestures faster and shorter. This general pattern was true for the opening gestures of the three other subjects as well, although it is obscured somewhat by their differing abilities to produce three distinct rates. The closing gestures shown in the right-hand column of Figure 3.9, by contrast, do not show the same consistent symmetry between speeding up and slowing down. In speeding up tempo, the closing gestures pattern like the opening gestures; at fast tempo the gesture has shorter movement durations and higher peak velocities for both thefinaland nonfinal syllables. In slowing down tempo, however, there was an increase in movement duration and a substantial decrease in movement velocity only for syllables in nonfinal position; phrase-final closing gestures were neither longer nor significantly slower at slow than at normal overall tempo. Subject CDJ showed this same asymmetry. The asymmetry can be summarized in either of the following two ways: subjects JRE and CDJ had little or no difference between normal and slow tempo durations and velocities for final syllables; or, these two subjects had little or no difference between final and nonfinal closing gesture durations and velocities at slow tempo. It is particularly noteworthy that, despite this lack of any difference in closing gesture duration, the contrast in overall syllable duration was preserved, as was shown above for JRE infigure3.8. It is particularly telling that these two subjects had generally longer syllable durations than did either subject KAJ or KDJ. We interpret this lack of contrast in closing gesture duration for JRE and CDJ as indicating some sort of lower limit on movement velocity or stiffness. That this limit is not reflected in the overall syllable duration, on the other hand, suggests that the subjects use some other mechanism - here probably a later phasing of the closing gesture - to preserve the prosodic contrast in the face of a limit on its usual dynamic specification. This different treatment of slow-tempo final syllables would explain the steeper regression curve slope in figure 3.7 above. Suppose that for fast and normal tempi, the nonfinal and final syllables have the same phasing, and that the difference in observed overall syllable duration results from the closing gestures being longer because they are less stiff in the phrase-final position. For these two tempi, the relationship between observed duration and predicted duration would be the same. At slow tempo, on the other hand, the phrase-final gestures may have reached the lower limit on gestural stiffness. Perhaps this is a physiological limit on gestural speed, or perhaps the gesture cannot be slowed any further without jeopardizing the identity of the [p] as a voiceless stop. In 82
3 M. Beckman, J. Edwards, and J. Fletcher
order to preserve the durational correlates of the prosodic contrast, however, the closing gesture is phased later, making the observed period of the syllable longer relative to its predicted period, and thus preserving the prosodic contrast in the face of this apparent physiological or segmental limit. 3.7 The linguistic model
That the prosodic contrast is preserved by substituting a different dynamic specification suggests that thefinallengthening has an invariant specification at some level of description above the gestural dynamics. In other words, our description of final lengthening as a local tempo change is accurate at the level of articulatory dynamics only to a first approximation, and we must look for an invariant specification of the effect at a yet more abstract level of representation. Should this more abstract level be equated with the phonological structures that represent categorical contrast and organization? We think not. Although final lengthening is associated with phonologically distinct phrasings, these distinctions are already represented categorically by the hierarchical structures that describe the prosodic organization of syllables, feet, and other phonological units. A direct phonological representation of the phrase-final lengthening would be redundant to this independently necessary prosodic structure. Moreover, we would like the abstract representation of final lengthening to motivate the differences between it and accentual lengthening and the similarities between it and slowing down tempo overall. A discrete phonological representation of lengthening at phrase edges, such as that provided by grid representations (e.g. Liberman 1975; Selkirk 1984), would be intractable for capturing these differences and similarities. A more promising approach is to try to describe the quantitative properties of the lengthenings associated with nuclear accent, phrase-final position, and overall tempo decrease in terms of some abstract phonetic representation that can mediate between the prosodic hierarchy and the gestural dynamics. We propose that the relevant level of description involves the time course of a substantive feature "sonority." We chose sonority as the relevant phonetic feature because of its role in defining the syllable and in relating segmental features to prosodic structures. Cross-linguistically, unmarked syllables are ones whose associated segments provide clear sonority peaks. The syllable is also a necessary unit for describing stress patterns and their phonological properties in time, including the alignment of pitch accents. The syllable is essential for examining the phonetic marks of larger intonational phrases, since in the prosodic hierarchy these units are necessarily coterminous with syllables. 83
Gesture
Our understanding of sonority owes much to earlier work on tone scaling (Liberman and Pierrehumbert 1984; Pierrehumbert and Beckman 1988). We see the inherent phonological sonority of a segment as something analogous to the phonological specification of a tone, except that the intrinsic scale is derived from the manner features of a segment, as proposed by Clements (1990a): stops have L (Low) sonority and open vowels have H (High) sonority. These categorical values are realized within a quantitative space that reflects both prosodic structure and paralinguistic properties such as overall emphasis and tempo. That is, just as prosodic effects on F o are specified by scaling H and L tones within an overall pitch range, prosodic effects on phonetic sonority are specified by scaling segmental sonority values within a sonority space. This sonority space has two dimensions. One dimension is Silverman and Pierrehumbert's (1990) notion of sonority: the impedance of the vocal-tract looking forward from the glottis. We index this dimension by overall vocal tract openness, which can be estimated by jaw height in our target [pap] sequences. The other dimension of the sonority space is time; a vertical increase in overall vocal-tract openness is necessarily coupled to a horizontal increase in the temporal extent of the vertical specification. In this twodimensional space, prosodic constituents all can be represented as rectangles having some value for height and width, as shown in figure 3.10b for the words poppa and pop. The phonological properties of a constituent will help to determine these values. For example, in the word poppa, the greater prosodic strength of the stressed first syllable as compared to the unstressed second syllable is realized by the larger sizes of the outer box for this syllable and of the inner box for its bimoraic nucleus. (Figure 3.10a shows the phonological representation for these constituents, using the moraic analysis of heavy syllables first proposed by Hyman 1985.) We understand the lengthening associated with accent, then, as part of an increase in overall sonority for the accented syllable's nucleus. The larger mean displacements of accented gestures reflect the vertical increase, and the later phasing represents the coupled horizontal increase, as in figure 3.10c. This sort of durational increase is analogous to an increase in local tonal prominence for a nuclear pitch accent within the overall pitch range. Final lengthening and slowing down tempi are fundamentally different from this lengthening associated with accent, in that neither of these effects is underlying a sonority change. Instead, both of these are specified as increases in box width uncoupled to any change in box height; they are strictly horizontal increases that pull the sides of a syllable away from its centre, as shown in figure 3.10d and e. The two strictly horizontal effects differ from each other in their locales. Slowing down tempo overall is a more global effect that stretches out a 84
3 M. Beckman, J. Edwards, and J. Fletcher
F
(a)
F
(b)
Time-
(c)
1
13
\
p
1
/
9
a
a
m
(f)
Figure 3.10 (a) Prosodic representations and (b) sonority specifications for the words pop and poppa. Effects on sonority specification of (c) accentual lengthening, (d) final lengthening, and (e) slowing down tempo, (f) Prosodic representation and sonority representation of palm
85
Gesture
syllable on both sides of its moraic center. It is analogous in tone scaling to a global change of pitch range for a phrase. Final lengthening, by contrast, is local to the phrase edge. It is analogous in tone scaling to final lowering. In proposing that these structures in a sonority-time space mediate between the prosodic hierarchy and the gestural dynamics, we do not mean to imply that they represent a stage of processing in a hypothetical derivational sequence. Rather, we understand these structures as a picture of the rhythmic framework for interpreting the dynamics of the segmental gestures associated to a prosodic unit. For example, in our target [pap] sequences, accentual lengthening primarily affected the phasing of the closing gesture into the following consonant. We interpret this pattern as an increase in the sonority of the accented syllable's moraic nucleus. Since the following [p] is associated to the following syllable, the larger sonority decreases its overlap with the vowel gesture. If the accented syllable were [pam] (palm), on the other hand, we would not predict quite so late a phasing for the [m] gesture, since the syllable-final [m] is associated to the second mora node in the prosodic tree, as shown in figure 3.1 Of. This understanding of the sonority space also supports the interpretation of Fo alignment patterns. For example, Steele (1987) found that the relative position of the Fo peak within the measured acoustic duration of a nuclearaccented syllable remains constant under durational increases due to overall tempo change. This is just what we would expect in our model of the sonority-time space if the nuclear accent is aligned to the sonority peak for the syllable nucleus. We would model this lengthening as a stretching of the syllable to both sides of the moraic center. In phrase-final position, on the other hand, Steele found that the Fo peak comes relatively earlier in the vowel. Again, this is just what we would predict from our representation of final lengthening as a stretching that is local to the phrase edge. This more abstract representation of the rhythmic underpinnings of articulatory dynamics thus allows us to understand the alignment between the F o pattern and the segmental gestures. In further work, we hope to extend this understanding to other segmental sequences and other prosodic patterns in English. We also hope to build on Vatikiotis-Bateson's (1988) excellent pioneering work in cross-linguistic studies of gestural dynamics, to assess the generality of our conclusions to analogous prosodic structures in other languages and to rhythmic structures that do not exist in English, such as phonemic length distinctions.
86
3 Comments
Comments on chapter 3 OSAMU FUJIMURA The paper by Beckman, Edwards, and Fletcher has two basic points: (1) the sonority contour defines temporal organization; and (2) mandible height is assumed to serve as a measure of sonority. In order to relate mandible movement to temporal patterns, the authors propose to use the taskdynamics model. They reason as follows. Since task dynamics, by adopting a given system time constant ("stiffness," see below), defines a fixed relation between articulatory movement excursion ("amplitude") and the duration of each movement, measuring the relation between the two quantities should test the validity of the model and reveal the role of adjusting control variables of the model for different phonetic functions. For accented vs. unaccented syllables, observed durations deviate from the prediction when a fixed condition for switching from one movement to the next is assumed, while under phrase-final situations, data conform with the prediction by assuming a longer time constant of the system itself. (Actually, the accented syllable does indicate some time elongation in the closing gesture as well.) Based on such observations, the authors suggest that there are two different mechanisms of temporal control: (1) "stiffness," which in this model (as in Browman and Goldstein 1986, 1990) means the system time constant; and (2) "phase," which is the timing of resetting of the system for a new target position. This resetting is triggered by the movement across a preset threshold position value, which is specified in terms of a percentage of the total excursion. The choice between these available means of temporal modulation depends on phonological functions of the control (amplitude should not interact with duration in this linear model). This is a plausible conclusion, and it is very interesting. There are some other observations that can be compared with this; for example, Macchi (1988) demonstrated that different articulators (the lower lip vs. the mandible) carry differentially segmental and suprasegmental functions in lip-closing gestures. I have some concern about the implication of this work with respect to the model used. Note that the basic principle of oscillation in task dynamics most naturally suggests that the rest position of the hypothetical spring-inertia system is actually the neutral position of the articulatory mechanism. In a sustained repetition of opening and closing movements for the same syllable, for example, this would result in a periodic oscillatory motion which is claimed to reflect the inherent nature of biological systems. In the current model, however, (as in the model proposed by Browman and Goldstein [1986, 1900]), the rest position of the mass, which represents the asymptote of a critically damped system, is not the articulatory-neutral position but the target position of the terminal gesture of each (demisyllabic) movement. 87
Gesture
Thus the target position must be respecified for each new movement. This requires some principle that determines the point in the excursion at which target resetting should take place. For example, if the system were a simple undamped oscillatory system consisting of a mass and a spring, it could be argued that one opening gesture is succeeded immediately by a closing gesture after the completion of the opening movement (i.e. one quarter cycle of oscillation); this model would result in a durational property absolutely determined by the system characteristics, i.e. stiffness of the spring (since inertia is assumed to equal unity), for any amplitude of the oscillation. In Beckman, Edwards, and Fletcher's model, presumably because of the need to manipulate the excursion for each syllable, a critically damped second-order (mass-spring) system is assumed. This makes it absolutely necessary to control the timing of resetting. However, this makes the assertion of task dynamics - that the biological oscillatory system dictates the temporal patterning of speech - somewhat irrelevant. Beckman, Edwards, and Fletcher's model amounts to assuming a critically damped second-order linear system as an impulse response of the system. This is a generally useful mathematical approximation for implementing each demisyllabic movement on the basis of a command. The phonetic control provides specifications of timing and initial (or equivalently target) position, and modifies the system time constant. The duration of each (upgoing or downgoing) mandibular movement is actually measured as the interval between the two extrema at the end of each (demisyllabic) movement. Since the model does not directly provide such smooth endpoints, but rather a discontinuous switching from an incomplete excursion towards the target position to the next movement (presumably starting with zero velocity), there has to be some ad hoc interpretation of the observed smooth function relative to the theoretical break-point representing the system resetting. Avoiding this difficulty, this study measures peak velocity, excursion, and duration for each movement. Peak velocity is measurable relatively accurately, assuming that the measurement does not suffer from excessive noise. The interpretation of excursion, i.e. the observed displacement, as the distance between the initial position and the terminating position (at the onset of next movement) is problematic (according to the model) because the latter cannot be related to the former unless a specific method is provided of deriving the smooth time function to be observed. A related problem is that the estimation of duration is not accurate. Specifically, Beckman, Edwards, and Fletcher's method of evaluating endpoints is not straightforward for two reasons. (1) Measuring the time value for either endpoint at an extremum is inherently inaccurate, due to the nature of extrema. Slight noise and small bumps, etc. affect the time value considerably. In particular, an error seems to transfer a portion of the 88
3 Comments
opening duration to the next closing duration according to Beckman, Edwards, and Fletcher's algorithm. The use of the time derivative zerocrossing is algorithmically simple, but the inherent difficulty is not resolved. (2) Such measured durations cannot be compared accurately with prediction of the theory, as discussed above. Therefore, while the data may be useful on their own merit, they cannot evaluate the validity of the model assumed. If the aim is to use specific properties of task dynamics and determine which of its particular specifications are useful for speech analyses, then one should match the entire time function by curve fitting, and forget about the end regions (and with them amplitude and duration, which depend too much on arbitrary assumptions). In doing so, one would probably face hard decisions about the specific damping condition of the model. More importantly, the criteria for newly identifying the rest position of the system at each excursion would have to be examined. The finding that phrase-final phenomena are different from accent or utterance-speed control is in conformity with previous ideas. The commonly used term "phrase-final (or preboundary) elongation" (Lehiste 1980) implies qualitatively and intuitively a time-scale expansion. The value of Beckman, Edwards, and Fletcher's work should be in the quantitative characterization of the way this modification is done in time. The same can be said about the conclusion that in phrase-final position, the initial and final parts of the syllable behave differently. One interesting question is whether such an alteration of the system constant, i.e. the time scale, is given continuously towards the end, or uniformly for the last phonological unit, word, foot, syllable, or demisyllable, in phrase-final position. Beckman, Edwards, and Fletcher suggest that if it is the latter, it may be smaller than a syllable, but a definitive conclusion awaits further studies. My own approach is different: iceberg measurement (e.g. Fujimura 1986) uses the consonantal gesture of the critical articulator, not the mandible, for each demisyllable. It depends on the fast-moving portions giving rather accurate assessment of timing. Some of the results show purely empirically determined phrasing patterns of articulatory movements, and uniform incremental compression/stretching of the local utterance speed over the pertinent phrase as a whole, depending on prominence as well as phrasing control (Fujimura 1987). I think temporal modulation in phrasal utterances is a crucial issue for phonology and phonetics. I hope that the authors will improve their techniques and provide definitive data on temporal control, and at the same time prove or refute the validity of the task-dynamics model.
89
4 Lenition of I hi and glottal stop JANET PIERREHUMBERTand DAVID TALKIN
4.1 Introduction
In this paper we examine the effect of prosodic structure on how segments are pronounced. The segments selected for study are /h/ and glottal stop /?/. These segments permit us to concentrate on allophony in source characteristics. Although variation in oral gestures may be more studied, source variation is an extremely pervasive aspect of obstruent allophony. As is well known, /t/ is aspirated syllable-initially, glottalized when syllable-final and unreleased, and voiced throughout when flapped in an intervocalic falling stress position; the other unvoiced stops also have aspirated and glottalized variants. The weak voiced fricatives range phonetically from essentially sonorant approximants to voiceless stops. The strong voiced fricatives exhibit extensive variation in voicing, becoming completely devoiced at the end of an intonation phrase. Studying /h/ and /?/ provides an opportunity to investigate the structure of such source variation without the phonetic complications related to presence of an oral closure or constriction. We hope that techniques will be developed for studying source variation in the presence of such complications, so that in time a fully general picture emerges. Extensive studies of intonation have shown that phonetic realization rules for the tones making up the intonation pattern (that is, rules which model what we do as we pronounce the tones) refer to many different levels of prosodic structure. Even for the same speaker, the same tone can correspond to many different Fo values, depending on its prosodic environment, and a given Fo value can correspond to different tones in different prosodic environments (see Bruce 1977; Pierrehumbert 1980; Liberman and Pierrehumbert 1984; Pierrehumbert and Beckman 1988). This study was motivated by informal observations that at least some aspects of segmental allophony 90
4 Janet Pierrehumbert and David Talkin
Figure 4.1 Wide-band spectrogram and waveform of the word hibachi produced with contrastive emphasis. Note the evident aspiration and the movement in F, due to the spread glottis during the /h/. The hand-marked segment locators and word boundaries are indicated in the lower panel: m is the /m/ release; v marks the vowel centers; h the /h/ center; b the closure onset of the /b/ consonant. The subject is DT
behave in much the same way. That is, we suspected that tone has no special privilege to interact with prosody; phonetic realization rules in general can be sensitive to prosodic structure. This point is illustrated in the spectrograms and waveforms of figures 4.1 and 4.2. In figure 4.1 the word hibachi carries contrastive stress and is well articulated. In figure 4.2, it is in postnuclear position and the /h/ is extremely lenited; that is, it is produced more like a vowel than the /h/ in figure 4.1. A similar effect of sentence stress on /h/ articulation in Swedish is reported in Gobi (1988). Like the experiments which led to our present understanding of tonal realization, the work reported here considers the phonetic outcome for particular phonological elements as their position relative to local and nonlocal prosodic features is varied. Specifically, the materials varied position relative to the word prosody (the location of the word boundary and the word stress) and relative to the phrasal prosody (the location of the phrase boundary and the phrasal stress as reflected in the accentuation). Although there is also a strong potential for intonation to affect segmental source characteristics (since the larynx is the primary articulator for tones), this issue is not substantially addressed in the present study because the difficul91
Gesture
Figure 4.2 Wide-band spectrogram and waveform of the word hibachi in postnuclear position. Aspiration and F, movement during /h/ are less than in figure 4.1. The subject is DT
ties of phonetic characterization for /h/ and /?/ led us to an experimental design with Low tones on all target regions. Pierrehumbert (1989) and a study in progress by Silverman, Pierrehumbert, and Talkin do address the effects of intonation on source characteristics directly by examining vocalic regions, where the phonetic characterization is less problematic. The results of the experiment support a parallel treatment of segmental source characteristics and tone by demonstrating that the production of laryngeal consonants depends strongly on both word- and phrase-level prosody. Given that the laryngeal consonants are phonetically similar to tones by virtue of being produced with the same articulator, one might ask whether this parallel has a narrow phonetic basis. Studies such as Beckman, Edwards, and Fletcher (this volume) which reports prosodic effects on jaw movement, indicate that prosody is not especially related to laryngeal articulations, but can affect the extent and timing of other articulatory gestures as well. We would suggest that prosody (unlike intonational and segmental specifications) does not single out any particular articulator, but instead concerns the overall organization of articulation. A certain tradition in phonology and phonetics groups prosody and intonation on the one hand as against segments on the other. Insofar as segments behave like tones, the grouping is called into question. We would 92
4 Janet Pierrehumbert and David Talkin
like instead to argue for a point of view which contrasts structure (the prosodic pattern) with content (the substantive demands made on the articulators by the various tones and segments). The structure is represented by the metrical tree, and the content by the various autosegmental tiers and their decomposition into distinctive features. This point of view follows from recent work in metrical and autosegmental phonology, and is explicitly put forward in Pierrehumbert and Beckman (1988). However, many of its ramifications remain to be worked out. Further studies are needed to clarify issues such as the degree of abstractness of surface phonological representations, the roles of qualitative and quantitative rules in describing allophony, and the phonetic content of distinctive features in continuous speech. We hope that the present study makes a contribution towards this research program. 4.2 Background 4.2.1 jhj and glottal stop Both /h/ and glottal stop /?/ are produced by a laryngeal gesture. They make no demands on the vocal-tract configuration, which is therefore determined by the adjacent segments. They are both less sonorous than vowels, because both involve a gesture which reduces the strength of voicing. For /h/, the folds are abducted. /?/ is commonly thought to be produced by adduction (pressing the folds together), as is described in Ladefoged (1982), but examination of inversefilteringresults and electroglottographic (EGG) data raised doubts about the generality of this characterization. We suggest that a braced configuration of the folds produces irregular voicing even when the folds are not pressed together (see further discussion below). 4.2.2 Source characterization The following broad characteristics of the source are crucial to our characterization. (1) For vowels, the main excitation during each pitch period occurs at the point of contact of the vocal folds, because the discontinuity in the glottal flow injects energy into the vocal tract which can effectively excite the formants (see Fant 1959). This excitation point is immediately followed by the "closed phase" of the glottal cycle during which the formants have their most stable values and narrowest band widths. The "open phase" begins when the vocal folds begin to open. During this phase, acoustic interaction at the glottis results in greater damping of the formants as well as shifts in their location. (2) "Softening" of vocal-fold closure and an increase in the open quotient is associated with the "breathy" phonation in /h/. The abduction 93
Gesture
gesture (or gesture of spreading the vocal folds) associated with this type of phonation brings about an increase in the frequencies and bandwidths of the formants, especially F^ an increase in the overall rate of spectral roll-off; an additional abrupt drop in the magnitude of the second and higher harmonics of the excitation spectrum; and an increase in the random noise component of the source, especially during the last half of the open phase. For some speakers, a breathy or soft voice quality is found during the postnuclear region of declaratives, as a reflex of phrasal intonation. (3) A "pressed" or "braced" glottal configuration is used to produce /?/. This is realized acoustically as period-to-period irregularities in the timing and spectral content of the glottal excitation pulses. A full glottal stop (with complete obstruction of airflow at the glottis) is quite unusual. Some speakers use glottalized voicing, rather than breathy voicing, during the postnuclear region of declaratives. 4.2.3 Prosody and intonation
We assume that the word and phrase-level prosody is represented by a hierarchical structure along the lines proposed by Selkirk (1984), Nespor and Vogel (1986), and Pierrehumbert and Beckman (1988) (see Ladd, this volume). The structure represents how elements are grouped phonologically, and what relationships in strength obtain among elements within a given grouping. Details of these theories will not be important here, provided that the representation makes available to the phonetic realization rules all needed information about strength and grouping. Substantive elements, both tones and segments, are taken to be autosegmentally linked to nodes in the prosodic tree. The tones and segments are taken to occur on separate tiers, and in this sense have a parallel relationship to the prosodic structure (see Broe, this volume). In this study, the main focus is on the relationship of the segments to the prosodic structure. The relationship of the tones to the prosodic structure enters into the study indirectly, as a result of the fact that prosodic strength controls the location of pitch accents in English. In each phrase, the last (or nuclear) pitch accent falls on the strongest stress in the phrase, and the prenuclear accents fall on the strongest of the preceding stresses. For this reason, accentuation can be used as an index of phrasal stress, and we will use the word "accented" to mean "having sufficient phrasal stress to receive an accent." "Deaccented" will mean "having insufficient phrasal stress to receive an accent"; in the present study, all deaccented words examined are postnuclear. Rules for pronouncing the elements on any autosegmental tier can reference the prosodic context by examining the position and properties of the node the segment is linked to. In particular, studies of fundamental 94
4 Janet Pierrehumbert and David Talkin
frequency lead us to look for sensitivity to boundaries (Is the segment at a boundary or not? If so, what type of boundary?) and to the strength of the nodes above the segment. Pronunciation rules are also sensitive to the substantive context. For example, in both Japanese and English, downstep or catathesis applies only when the tonal sequence contains particular tones. Similarly, /h/ has a less vocalic pronunciation in a consonantal environment than in a vocalic one. Such effects, widely reported in the literature on coarticulation and assimilation, are not investigated here. Instead, we control the segmental context in order to concentrate on the less well understood prosodic effects. Although separate autosegmental tiers are phonologically independent, there is a strong potential for phonetic interaction between tiers in the case examined here, since both tones and laryngeal consonants make demands on the laryngeal configuration. This potential interaction was not investigated, since our main concern was the influence of prosodic structure on segmental allophony. Instead, intonation was carefully controlled to facilitate the interpretation of the acoustic signal. 4.3 Experimental methods 4.3.1 Guiding considerations
The speech materials and algorithms for phonetic characterization were designed together in order to achieve maximally interpretable results. Source studies such as Gobi (1988) usually rely on inverse filtering, a procedure in which the effects of vocal-tract resonances are estimated and removed from the signal. The residue is an estimate of the derivative of theflowthrough the glottis. This procedure is problematic for both /?/ and /h/. For /?/, it is difficult to establish the pitch periods to which inverse filtering should be applied. (Inverse filtering carried out on arbitrary intervals of the signal can have serious windowing artifacts). Inverse filtering of /h/ is problematic because of its large open quotient. This can introduce subglottal zeroes, rendering the all-pole model of the standard procedure inappropriate, and it can increase the frequency and bandwidth of the first formant to a point where its location is not evident. The unknown contribution of noise also makes it difficult to evaluate the spectral fit to the periodic component of the source. These considerations led us to design materials and algorithms which would allow us to identify differences in source characteristics without first estimating the transfer function. Specific considerations guiding the design of the materials were: (1) Fx is greater than three times Fo. This minimizes the effects of F, bandwidth and location on the low-frequency region, allowing it to reflect source character95
Gesture
istics in a more straightforward fashion. (2) Articulator movement in the upper vocal tract is minimal during target segments. (3) The consonants under study are produced by glottal gestures in an open-vowel environment to facilitate interpretation of changes to the vocal source. 4.3.2 Materials In the materials for the experiment, the position of /h/ and /?/ relative to word-level and phrase-level prosodic structure is varied. We lay out the full experimental design here, although we will only have space to discuss some subsets of the data which showed particularly striking patterns. In the materials /h/ is exemplified word-initially and -medially, before both vowels with main word stress and vowels with less stress: mahogany
tomahawk hibachi
hogfarmers Omaha
hawkweed
The original intention was to distinguish between a secondary stress in tomahawk and an unstressed syllable at the end of Omaha, but it did not prove possible to make this distinction in the actual productions, so these cases will be treated together as "reduced" /h/s. Intervocalic /?/ occurs only word-initially. /?/ as an allophone of /t/ is found before syllabic nasals, as in "button," but not before vowels.) So, for /?/ we have the following sets of words, providing near minimal comparisons to the word-initial /h/s: August
awkwardness abundance augmentation
Augustus
This set of words was generated using computerized searches of several online dictionaries. The segmental context of the target consonant was designed to have a high first formant value and minimize formant motion, in order to simplify the acoustic phonetic analysis. The presence of a nasal in the vicinity of the target consonant is undesirable, because it can introduce zeroes which complicate the evaluation of source characteristics. This suboptimal choice was dictated by the scarcity of English words with medial /h/, even in the educated vocabulary. We felt that it was critical to use real words rather than nonsense words, in order to have accurate control over the word-level prosody and in order to avoid exaggerated pronunciations. Words in which the target consonant was word-initial were provided with a /ma/ context on the left by appropriate choice of the preceding word. Phrases such as the following were used: 96
4 Janet Pierrehumbert and David Talkin
Oklahoma August lima abundance figures plasma augmentation The position of the target words in the phrasal prosody were also manipulated. The phrasal positions examined were (1) accented without special focus, (2) accented and under contrast, (3) accented immediately following an intonational phrase boundary, and (4) postnuclear. In order to maximize the separation of F o and F p the intonation patterns selected to exemplify these positions all had L tones (leading to low Fo) at the target location. This allows the source influences on the low-frequency region to be seen more clearly. The intonation patterns were also designed to display a level (rather than time-varying) intonational influence on F o , again with a view to minimizing artifacts. The accented condition used a low-rising question pattern (L* H H% in the transcription of Pierrehumbert [1980]): (1)
Is he an Oklahoma hogfarmer?
Accent with contrast was created by embedding the "contradiction" pattern (L* L H%) in the following dialogue: (2)
A: Is it mahogany? B: No, it's rosewood. C: It's mahogany!
In the "phrase boundary" condition, a preposed vocative was followed by a list, as in the following example: (3)
Now Emma, August is hot here, July is hot here, and even June is hot here.
The vocative had a H* L pattern (that is, it had a falling intonation and was followed by an intermediate phrase boundary rather than a full intonation break). Non-final list items had a L* H pattern, while the final list item (which was not examined) had a H* L L% pattern. The juncture of the H* L vocative pattern with the L* H pattern of the first list item resulted in a low, level F o configuration at the target consonant. Subjects were instructed to produce the sentences without a pause after the vocative, and succeeded in all but a few utterances (which were eliminated from the analysis). No productions lacked the desired intonational boundary. In the "postnuclear" condition, the target word followed a word under contrast: (4)
They're Alabama hogfarmers, not Oklahoma hogfarmers.
In this construction, the second occurrence of the target word was the one analyzed. 97
Gesture
4.3.3 Recording procedures Since pilot studies showed that subjects had difficulty producing the correct patterns in a fully randomized order, a blocked randomization was used. Each block consisted of twelve sentences with the same phrasal intonation pattern; the order within each block was randomized. The four blocks were then randomized within each set. Six sets were recorded. The first set was discarded, since it included a number of dysfluent productions for both speakers. The speech was recorded in an anechoic chamber using a 4165 B&K microphone with a 2231 B&K amplifier, and a Sony PCM-2000 digital audio tape recorder. The speakers were seated. A distance of 30 cm from the mouth to the microphone was maintained. This geometry provides intensity sensitivity due to head movement of approximately 0.6 dB per cm change in microphone-to-mouth distance. Since we have no direct interface between the digital tape recorder and the computer, the speech was played back and redigitized at 12 kHz with a sharp-cutoff anti-alias filter set at 5.8 kHz, using an Ariel DSP-16 rev. G board, yielding 16 bits of precision. The combined system has essentially constant amplitude and phase response from 20 Hz to over 5 kHz. The signal-to-noise ratio for the digitized data was greater than 55 dB. Electroglottographic signals were recorded and digitized simultaneously on a second channel to provide a check for the acoustically determined glottal epochs which drive the analysis algorithms. 4.4 Analysis algorithms and their motivation The most difficult part of the study was developing the phonetic characterization, and the one used is not fully successful. Given both the volume of speech to be processed and the need for replicability, it is desirable to avoid measurement procedures which involve extensive fitting by eye or other subjective judgment. Instead, we would argue the need for semi-automatic procedures, in which the speech is processed using well-defined and tested algorithms whose results are then scanned for conspicuous errors. 4.4.1 Pitch-synchronous analysis The acoustic features used in this study are determined by "pitch-synchronous" analyses in which the start of the analysis window is phase-locked on the time of glottal closure and the duration of the window is determined by the length of the glottal cycle. Pitch-synchronous analysis is desirable because it offers the best combination of physical insight and time resolution. One glottal cycle is the minimum period of interest, since it is difficult to draw 98
4 Janet Pierrehumbert and David Talkin
conclusions about the laryngeal configuration from anything less. When the analysis window is matched to the cycle in both length and phase, the results are well behaved. In contrast, when analysis windows the length of a cycle are applied in arbitrary phase to the cycle, extensive signal-processing artifacts result. Therefore non-pitch-synchronous moving-window analyses are typically forced to a much longer window length in order to show well-behaved results. The longer window lengths in turn obscure the speech events, which can be extremely rapid. Pitch-synchronous analysis is feasible for segments which are voiced throughout because the point of glottal closure can be determined quite reliably from the acoustic waveform (Talkin 1989). We expected it to be applicable in our study since the regions of speech to be analyzed were designed to be entirely voiced. For subject DT, our expectations were substantially met. For subject MR, strong aspiration and glottalization in some utterances interfered with the analysis. Talkin's algorithm for determining epochs, or points of glottal closure, works as follows: speech, recorded and digitized using a system with known amplitude and phase characteristics, is amplitude- and phase-corrected and then inverse-filtered using a matched-order linear predictor to yield a rough approximation to the derivative of the glottal volume velocity (LP). The points in the U' signal corresponding to the epochs have the following relatively stable characteristics: (1) Constant polarity (negative), (2) Highest amplitude within each cycle, (3) Rapid return to zero after the extremum, (4) Periodically distributed in time, (5) Limited range of inter-peak intervals, and (6) Similar size and shape to adjacent peaks. A set of peak candidates is generated from all local maxima in the U' signal. Dynamic programming is then used to find the subset of these candidates which globally best matches the known characteristics of U' at the epoch. The algorithm has been evaluated using epochs determined independently from simultaneously recorded EGG signals and was found to be quite accurate and robust. The only errors that have an impact on the present study occur in strongly glottalized or aspirated segments. 4.4.2 Measures used Pitch-synchronous measurements used in the current study are (1) root mean square of the speech samples in the first 3 msec, following glottal closure expressed in dB re unity (RMS), (2) ratio of per-period energy in the fundamental to that in the second harmonic (HR), and (3) local standard deviation in period length (PDEV). RMS and HR are applied to /h/, and PDEV is applied to /?/. Given the relatively constant intertoken phonetic context, RMS provides 99
Gesture
an intertoken measure closely related to the strength of the glottal excitation in corresponding segments. The integration time for RMS was held constant to minimize interactions between Fo and formant band widths. RMS was not a useful measure for /?/, since even a strongly articulated /?/ may have glottal excitation as strong as the neighboring vowels; the excitation is merely more irregular. RMS is relatively insensitive to epoch errors in /h/s, since epochlocation uncertainty tended to occur when energy was more evenly distributed through the period, which in turn renders the measurement point for RMS less critical. HR is computed as the ratio (expressed in dB) of the magnitudes of the first and second harmonics of an exact DFT (Discrete Fourier Transform) computed over one glottal period. The period duration is from the current epoch to the next, but the start time of the period is taken to be at the zero crossing immediately preceding the current epoch. This minimizes periodicity violations introduced by cycle-to-cycle excitation variations, since the adjusted period end will usually also fall in a low-amplitude (near zero) region of the cycle. HR is a relevant measure because the increase in open quotient of the glottal cycle and the lengthening of the time required to accomplish glottal closure associated with vocal-fold abduction tends to increase the power in the fundamental relative to the higher harmonics. This increase in the average and minimum glottal opening also changes the vocaltract tuning and sub- to superglottal coupling. The net acoustic effect is to introduce a zero in the spectrum in the vicinity of Fj and to increase both the frequency and bandwidths of the formants, especially F r Since our speech material was designed to keep F{ above the second harmonic, these effects all conspire to increase HR with abduction. The reader is referred to Fant and Lin (1988) for supporting mathematical derivations. Figure 4.3 illustrates the behavior of the HR over the target intervals which include the two /h/s shown in figures 4.1 and 4.2. Fant and Lin's derivations do not attempt to model the contribution of aspiration to the spectral shape, and the relation of abduction to HR indeed becomes nonmonotonic at the point at which aspiration noise becomes the dominant power source in the second-harmonic region. One of the subjects, DT, has sufficiently little aspiration during the /h/s that this nonmonotonicity did not enter substantially into the analysis, but for subject MR it was a major factor, and as a result RMS shows much clearer patterns. HR is also sensitive to serious epoch errors, rendering it inapplicable to glottalized regions. PDEV is the standard deviation of the seven glottal period lengths surrounding the current epoch. This measure represents an effort to quantify the irregular periodicity which turned out to be the chief hallmark of /?/. It was adopted after detailed examination of the productions failed to support 100
4 Janet Pierrehumbert and David Talkin
"n
34 dB
^^Y /v Y r ^ / V X/ V u ^^
Figure 4.3 HR and RMS measured for each glottal period throughout the target intervals of the utterances introduced infigures4.1 and 4.2. Note that the difference of ~34 dB between the HR in the /h/ and following vowel for the well-articulated case (top) is much greater than the ~2dB observed in the lenited case (bottom). The RMS values discussed in the text are based on the (linear) RMS displayed in this figure
the common understanding that /?/ is produced by partial or complete adduction of the vocal folds. This view predicts that the spectrum during glottalization should display a lower HR and a less steep overall spectral rolloff than are found in typical vowels. However, examination of the EGG signal in conjunction with inverse-filtering results showed that many tokens had a large, rather than a small, open quotient and even showed aspiration noise during the most closed phase, indicating that the closure was incomplete. The predicted spectral hallmarks were not found reliably, even in utterances in which glottalization was conspicuously indicated by irregular periodicity. We surmise that DT in particular produces /?/ by bracing or tensing partially abducted vocal folds in a way which tends to create irregular vibration without a fully closed phase. 4.4.3 Validation using synthetic signals
In order to validate the measures, potential artifactual variation due to F 0-Fj interactions was assessed. A six-formant cascade synthesizer excited by a 101
Gesture
Liljencrants-Fant voice source run at 12 kHz sampling frequency was used to generate synthetic voiced-speech-like sounds. These signals were then processed using the procedures outlined above. F, and Fo were orthogonally varied over the ranges observed in the natural speech tokens. ¥ x bandwidth was held constant at 85 Hz while its frequency took on values of 500 Hz, 700 Hz and 800 Hz. For each of these settings the source fundamental was swept from 75 Hz to 150 Hz during one second with the open quotient and leakage time held constant. The bandwidths and frequencies of the higher formants were held constant at nominal 17 cm, neutral vocal-tract values. Note that the extremes in F{ and Fo were not simultaneously encountered in the natural data, so that this test should yield conservative bounds on the artifactual effects. As expected, PDEV did not vary significantly throughout the test signal. The range of variation in HR for these test signals was less than 3 dB. The maximum peak-to-valley excursion in RMS due to F o harmonics crossing the formants was 2 dB with a change in Fo from 112 Hz to 126 Hz and FY at 500 Hz. This is small compared to RMS variations observed in the natural speech tokens under study. 4.4.4 Analysis of the data Time points were established for the /m/ release, the first vowel center, the center of the /h/ or glottal stop, the center of the following vowel, and the point of oral constriction for the consonant. This was done by inspection of the waveform and broad-band spectrogram, and by listening to parts of the signal. The RMS, HR and PDEV values for the vowel were taken to be the values at the glottal epoch nearest to the measured vowel center. RMS was used to estimate the /h/ duration, since it was almost invariably lower at the center of the /h/ than during the following vowel. The /h/ interval was defined as the region around the minimum RMS observed for the /h/ during which RMS did not exceed a locally determined threshold. Taking RMS(C) as the minimum RMS observed and RMS(V2) as the maximum RMS in the following vowel, the threshold was defined as RMS(C) + 0.25*[RMS(V2) - RMS(C)]. The measure was somewhat conservative compared to a manual segmentation, and was designed to avoid spurious inclusions of the preceding schwa when it was extremely lenited. The consonantal value for RMS was taken to be the RMS minimum, and its HR value was taken to be the maximum HR during the computed /h/ interval. The PDEV value for the /?/ was taken at the estimated center, since the 102
4 Janet Pierrehumbert and David Talkin
intensity behaviour of the /?/s did not support the segmentation scheme developed for the /h/s. Durations for /?/ were not estimated. 4.5 Results
After mentioning some characteristics of the two subjects' speech, we first present results for /h/ and then make some comparisons to /?/. 4.5.1 Speaker characteristics There were some obvious differences in the speech patterns of the two subjects. When these differences are taken into account, it is possible to discern strong underlying parallels in the effects of prosody on /h/ and /?/ production. MR had vocal fry in postnuclear position. This was noticeable both in the deaccented condition and at the end of the preposed vocative Now Emma. He had strong aspiration in /h/, leading to failure of the epoch finding in many cases and also to nonmonotonic behavior of the HR measure. As a result, the clearest patterns are found in RMS (which is more insensitive than HR to epoch errors) and in duration. In general, MR had clear articulation of consonants even in weak positions. DT had breathiness rather than fry in postnuclear position. Aspiration in /h/ was relatively weak, so that the epoch finder and the HR measure were well behaved. Consonants in weak positions were strongly reduced. 4.5.2 Effects of word prosody and phrasal stress on jhj
Both the position in the word and the phrasal stress were found to affect how /h/ was pronounced. In order to clarify the interpretation of the data, we would first like to present some schematic plots. Figure 4.4 shows a blank plot of RMS in the /h/ against RMS in the vowel. Since the /h/ articulation decreases the RMS, more /h/-like /h/s are predicted to fall towards the left of the plot while more vowel-like /h/s fall towards the right of the plot. Similarly, more /h/-like vowels are predicted to fall towards the bottom of the plot, while more vowel-like vowels should fall towards the top of the plot. The line y = x, shown as a dashed diagonal, represents the case where the /h/ and the vowel had the same measured RMS. That is, as far as RMS is concerned, there was no local contrast between the /h/ and the vowel. Note that this case, the case of complete neutralization, is represented by a wide range of values, so that the designation "complete lenition" does not actually fix the articulatory gesture. In general, we do not expect to find /h/s which are 103
O
o
CD
CD O
40
60
70
more V-like •
h/-li
Figure 4.4 A schema for interpreting the relation of RMS in the /h/ to RMS in the following vowel. Greater values of RMS correspond to more vowel-like articulations, and lesser values correspond to more /h/-like articulations. The line y = x represents the case in which the /h/ and the following vowel do not contrast in terms of RMS. Distance perpendicular to this line represents the degree of contrast between the /h/ and the vowel. Distance parallel to this line cannot be explained by gestural magnitude, but is instead attributed to a background effect on both the /h/ and the vowel. The area below and to the right of y = x is predicted to be empty
o
CO
-->
JOLU
CD
like
O CM
> CD
O
10LU
10 • more V-like
20
30
40
more /h/-like -
Figure 4.5 A schema for interpreting the relation of HR in the /h/ to HR in the following vowel. It has the same structure as in figure 4.4, except that greater rather than lesser values of the parameter represent more /h/-like articulations
4 Janet Pierrehumbert and David Talkin
more vocalic than the following vowel, so that the lower-right half is expected to be empty. In the upper-left half, the distance from the diagonal describes the degree of contrast between the /h/ and the vowel. Situations in which both the /h/ and the vowel are more fully produced would exhibit greater contrast, and would therefore fall further from the diagonal. Note again that a given magnitude of contrast can correspond to many different values for the /h/ and vowel RMS. Figure 4.5 shows a corresponding schema for HR relations. The structure is the same except that higher, rather than lower, x and y values correspond to more /h/-like articulations. In view of this discussion, RMS and HR data will be interpreted with respect to the diagonal directions of each plot. Distance perpendicular to the y = x line (shown as a dotted line in each plot) will be related to the strength or magnitude of the CV gesture. Location parallel to this line, on the other hand, is not related to the strength of the gesture, but rather to a background effect on which the entire gesture rides. One of the issues in interpreting the data is the linguistic source of the background effects. Figures 4.6 and 4.7 compare the RMS relations in word-initial stressed /h/, when accented in questions and when deaccented. The As are farther from the y = x line than the Ds, indicating that the magnitude of the gesture is greater when /h/ begins an accented syllable. For subject DT, the two clouds of points can be completely separated by a line parallel to y = x. Subject MR shows a greater range of variation in the D case, with the most carefully articulated Ds overlapping the gestural magnitude of the As. Figures 4.8 and 4.9 make the same comparison for word-medial /h/ preceding a weakly stressed or reduced vowel. These plots have a conspicuously different structure from figures 4.6 and 4.7. First, the As are above and to the right of the Ds, instead of above and to the left. Second, the As and Ds are not as well separated by distance from the y = x line; whereas this separation was clear for word-initial /h/s, there is at most a tendency in this direction for the medial reduced /h/s. The HR data shown for DT in figures 4.10 and 4.11 further exemplifies this contrast. Word-initial /h/ shows a large effect of accentuation on gestural magnitude. For medial reduced /h/ there is only a small effect on magnitude; however, the As and Ds are still separated due to the lower HR values during the vowel for the As. HR data is not presented for MR because strong aspiration rendered the measure a nonmonotonic function of abduction. Since the effect of accentuation differs depending on position in the word, we can see that both phrasal prosody and word prosody contribute to determining how segments are pronounced. In decomposing the effects, let us first consider the contrasts in gestural magnitude, that is perpendicular to the x = y line. In the case of hawkweed and hogfarmer, the difference between As 105
Gesture A A
V\
o
D
A
•D
c 1Z
CD
D
A
b\ DC
D D
D D
D D
CD
.•' in in
45
40
50
55
60
RMS in /h/ in dB A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.6 RMS in /h/ of hawkweed and hogfarmer plotted against RMS in the following vowel, when the words are accented in questions (the As) and deaccented (the Ds). The subject is DT
\
A
A A
A \ tA
\
\
D
A
D D
CO •o
. D
D
D
D
D"'•••-.
S
D • • • . .
o
.
.
-
•
•
CD
D 50
55
60
65
RMS in /h/ in dB A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.7 RMS in /h/ of hawkweedand hogfarmer plotted against RMS in the following vowel, when the words are accented in questions (the As) and deaccented (the Ds). The subject is MR 106
4 Janet Pierrehwnbert and David Talkin
CD
-o
c/)
o
ID ID
40
45
50
55
60
RMS in /h/ in dB A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.8 RMS in /h/ of Omaha and tomahawk plotted against RMS in the following vowel, when the words are accented in questions and deaccented. The subject is DT
CD •D
cc o
CD
50
55
60
65
RMS in /h/ in dB A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.9 RMS in /h/ of Omaha and tomahawk plotted against RMS in the following vowel, when the words are accented in questions and deaccented. The subject is MR 107
Gesture
en •u
c
DC X
20
10
30
40
HR in /h/ in dB A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.10 HR in /h/ of hawkweed and hogfarmer plotted against HR in the following vowel, when the words are accented in questions and deaccented. The subject is DT
CO •D
DC
20
10
30
40
HR in /h/ in dB A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.11 HR in /h/ of Omaha and tomahawk plotted against HR in the following vowel, when the words are accented in questions and deaccented. The subject is DT 108
4 Janet Pierrehumbert and David Talkin
and Ds is predominately in this direction. The Omaha and tomahawk As and Ds exhibit a small difference in this direction, though this is not the most salient feature of the plot. From this we deduce that accentuation increases gestural magnitude, making vowels more vocalic and consonants more consonantal. The extent of the effect depends on location with respect to the word prosody; the main stressed word-initial syllable inherits the strength of accentuation, so to speak, more than the medial reduced syllable does. At the same time we note that in tomahawk and Omaha, the As are shifted relative to the Ds parallel to the y = x line. That is, both the consonant and the vowel are displaced in the vocalic direction, as if the more vocalic articulation of the main stressed vowel continued into subsequent syllables. The data for tomahawk and Omaha might thus be explicated in terms of the superposition of a local effect on the magnitude of the CV gesture and a larger-scale effect which makes an entire region beginning with the accented vowel more vocalic. The present data do not permit a detailed analysis of what region is affected by the background shift in a vocalic direction. Note that the effect of a nuclear accent has abated by the time the deaccented postnuclear target words are reached, since these show a more consonantal background effect than the accented words do. In principle, data on the word mahogany would provide critical information about where the effect begins, indicating, for example, whether the shift in the vocalic direction starts at the beginning of the first vowel in an accented word or at the beginning of the stressed vowel in a foot carrying an accent. Unfortunately, the mahogany data showed considerable scatter and we are not prepared at this point to make strong claims about their characterization. 4.5.3 The Effect of the phrase boundary on jhj
It is well known that syllables are lengthened before intonational boundaries. Phrase-final voiced consonants are also typically devoiced. An interesting feature of our data is that it also demonstrated lengthening and suppression of voicing after an intonational boundary, even in the absence of a pause. Figures 4.12 and 4.13 compare duration and RMS in word-initial /h/ after a phrase boundary (that is, following Now Emma with word-initial /h/) in accented but phrase-medial position, and in deaccented (also phrase-medial) position. In both plots, the " %" points are below and to the right of the A and D points, indicating a combination of greater length and less strong voicing. DT shows a strong difference between A and D points, with Ds being shorter and more voiced than As. MR shows at most a slight difference between As and Ds, reflecting his generally small degree of lenition of 109
Gesture
o
CD
CO
c
o m C/) DC
0.0
0.10
0.05
Duration of /h/ in seconds % phrase boundary; A accented in questions; D deaccented
Figure 4.12 Duration vs. RMS in /h/ for hawkweed and hogfarmer when accented at a phrase boundary, accented but phrase-medial in questions, and deaccented. The subject is DT in CD
o
CD
CD
•o
c
CO
in in
DC
0.0
0.05
0.10
0.15
Duration of /h/ in seconds % phrase boundary; A accented in questions; D deaccented
Figure 4.13 Duration vs. RMS in /h/ for hawkweed and hogfarmer when accented at a phrase boundary, accented but phrase-medial in questions, and deaccented. The subject is MR 110
4 Janet Pierrehumbert and David Talkin
consonants in weak positions. For MR, the effect of the phrase boundary is thus a more major one than the effect of accentual status. A subset of the data set, the sentences involving tomahawk, make it possible to extend the result to a nonlaryngeal consonant. The aspiration duration for the /t/ was measured in the four prosodic positions. The results are displayed in figures 4.14 and 4.15. The lines represent the total range of observations for each condition, and each individual datum is indicated with a tick. For DT, occurring at a phrase boundary approximately doubled the aspiration length, and there was no overlap between the phrase-boundary condition and the other conditions. For MR, the effect was somewhat smaller, but the overlap can still be attributed to only one point, the smallest value for the phrase-boundary condition. For both subjects, a smaller effect of accentuation status can also be noted. The effect of the phrase boundary on gestural magnitude can be investigated by plotting the RMS in the /h/ against RMS in the vowel, the wordinitial accented /h/ in phrase-initial and phrase-medial position. This comparison, shown infigures4.16 and 4.17, indicates that the gestural magnitude was greater in phrase-initial position. The main factor was lower RMS (that is, a more extreme consonantal outcome) for the /h/ in phrase-initial position; the vowels differed slightly, if at all. Returning to the decomposition in terms of gestural-magnitude effects and background effects, we would suggest that the phrase boundary triggers both a background shift in a consonantal direction (already observed in preboundary position in the "deaccented" cases) and an increase in gestural magnitude. The effect on gestural magnitude must be either immediately local to the boundary, or related to accentual strength, if deaccented words in the middle of the postnuclear region are to be exempted as observed. It is interesting to compare our results on phrase-initial articulation with Beckman, Edwards, and Fletcher's results (this volume) on phrase-final articulation. Their work has shown that stress-related lengthening is associated with an increase in the extent of jaw movement while phrase-final lengthening is not, and they interpret this result as indicating that stress triggers an underlying change in gestural magnitude while phrase-final lengthening involves a change in local tempo but not gestural magnitude. Given that our data do show an effect of phrase-initial position on gestural magnitude, their interpretation leads to the conclusion that phrase-initial and phrase-final effects are different in nature. However, we feel that the possibility of a unified treatment of phraseperipheral effects remains open, pending the resolution of several questions about the interpretation of the two experiments. First, it is possible that the gestural-magnitude effect observed in our data is an artifact of the design of the materials, since the Now Emma sentences may have seemed more unusual 111
Gesture
questions
i—i—i
c
deaccented
o
contradiction
LLJ
phrase boundary
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Duration in seconds
Figure 4.14 Voice-onset time in /t/ of tomahawk for all four prosodic positions; subject DT. Ticks indicate individual data points
questions h-hH
c
V
deaccented
CD C
o
contradiction
phrase boundary +
0.02
-H-
0.04
0.06
+-
0.08
Duration in seconds
Figure 4.15 Voice-onset time in /t/ of tomahawk for all four prosodic positions; subject MR. Ticks indicate individual data points 112
4 Janet Pierrehumbert and David Talkin M 1
M 1M M
MXI
CD c 1Z
>
•
M M
> c
1
CO DC
45
40
55
50
60
RMS in /h/ in dB I phrase-initial; M phrase-medial; lines: y = x and y = -x + b
Figure 4.16 RMS of/h/ andfinalvowel for subject DT in accented phrase-initial (I) and phrasemedial (M) question contexts MM
1 \
m _ 1 GQ T5
1
M1 | 1 ' M 1
o r-
RMS in >
o in CD
o CD ..--'
'•••..,
'•••.. ..•••'
i
50
.-•
i
i
i
55
60
65
RMS in /h/ in dB I phrase-initial; M phrase-medial; lines: y = x and y = -x + b
Figure 4.17 RMS of /h/ and final vowel for subject MR in accented phrase-initial (I) and phrase-medial (M) question contexts 113
Gesture
or semantically striking than those where the target words were phraseinternal. If this is the case, semantically matched sentences would show a shift towards the consonantal direction in the vowel following the consonant, as well as the consonant itself. Second, it is possible that an effect on intended magnitude is being obscured in Beckman, Edwards, and Fletcher's (this volume) data by the nonlinear physical process whose outcome serves as an index. Possibly, jaw movement was impeded after the lips contacted for the labial consonant in their materials, so that force exerted after this point did not result in statistically significant jaw displacement. If this is the case, measurements of lip pressure or EMG (electromyography) might yield results more in line with ours. Third, it is possible that nonlinearities in the vocal-fold mechanics translate what is basically a tempo effect in phraseinitial position into a difference in the extent of the acoustic contrast. That is, it is possible that the vocal folds are no more spread for the phrase-initial /h/s than for otherwise comparable /h/s elsewhere but that maintaining the spread position for longer is in itself sufficient to result in greater suppression of the oscillation. This possibility could be evaluated using high-speed optical recording of the laryngeal movements. 4.5.4 Observations about glottalization Although all /h/s in the study had some noticeable manifestation in the waveform, this was not the case for /?/. In some prosodic positions, glottalization for /?/ appeared quite reliably, whereas in others it did not. One might view /?/ insertion as an optional rule, whose frequency of application is determined in part by prosodic position. Alternatively, one might take the view that the /?/ is always present, but that due to the nonlinear mechanics involved in vocal-fold vibration, the characteristic irregularity only becomes apparent when the strength and/or duration of the gesture is sufficiently great. That is, the underlying control is gradient, just as for /h/, but a nonlinear physical system maps the gradient control signal into two classes of outcomes. From either viewpoint, an effect of prosodic structure on segmental production can be demonstrated; the level of representation for the effect cannot be clarified without further research on vocal-fold control and mechanics. Table 4.1 summarizes the percentage of cases in which noticeable glottalization for /?/ appeared. The columns represent phrasal prosody and the rows indicate whether the target syllable is stressed or reduced in its word. The most striking feature of the table is that the reduced, non-phrase-boundary entries are much lower than the rest, for both subjects. That is, although stressed syllables had a high percentage of noticeable /?/s in all positions, reduced syllables had a low percentage except at a phrase boundary. This 114
4 Janet Pierrehumbert and David Talkin Table 4.1 Percentage of tokens with noticeable /?/ Subject
Stress
%-boundary
Accented
Deaccented
MR
stressed reduced stressed reduced
100 93 90 97
85 33 95 17
100 44 80 27
DT
s -\ d
z:
o %
Q
A
o
o d 0.0
0.002
0.004
0.006
PDEV in I?I in seconds /A accented in questions; D deaccented; % phrase boundaryy Figure 4.18 PDEV for /?/ beginning August and awkwardness plotted against PDEV in the following vowel, for subject DT
result shows that word-level and phrase-level prosody interact to determine the likelihood of observed glottalization. It does not provide any information about the degree of glottalization in cases where it was equally likely. In figure 4.18, PDEV is used to make this comparison for subject DT, for syllables with word stress. Only utterances in which glottalization was observed are included. In the deaccented tokens, PDEV during /?/ was overall much lower than in the accented phrase-medial tokens or the phraseinitial tokens. 115
Gesture
4.6 Discussion and conclusions
The experiment showed that the pronunciation of both /h/ and /?/ depends on word- and phrase-level prosody. We decompose these effects into effects on gestural magnitude and background effects. An overall shift in a vocalic direction was associated with accent, beginning at the rhyme of the accented syllable and affecting even later syllables in the same word. The phrase boundary was found to shift articulation on both sides in a more consonantal direction; related phrase-initial lengthening of the consonant, analogous to the phrase-final lengthening observed by many other researchers, was also observed. Superimposed on the background effects we observe effects on gestural magnitude related to the strength of a segment's prosodic position in the word and in the phrase. Accent affected the gestural magnitude both for main stressed and reduced syllables within the accented word, but it affected the stressed syllables more. There is also some evidence for a phraseboundary effect on gestural magnitude, although further investigation is called for. The interaction of effects on gestural magnitude and background effects is highly reminiscent of the interactions between local and large-scale effects which have proved critical for modeling the manifestations of tone in Fo contours. The effects on gestural magnitude for /h/ and /?/ are broadly analogous to the computations involved in mapping tones into Fo target levels or excursions, while the background effects are reminiscent of effects such as declination and final lowering which affect the Fo values achieved for tones in an entire region. Thus, the experimental results support a parallel treatment of segments and tones in terms of their phonological representation and the phonetic realization rules which interpret them. They argue against the view which segregates tone and intonation into a "suprasegmental" component, a view which still underlies current speech technology (Lea 1980; Allen, Hunnicutt, and Klatt 1987; Waibel 1988). This view provides for prosodic effects on Fo, intensity, and duration, but does not support the representations or rules needed to describe prosodic effects on segmental allophony of the kind observed here. Our observations about /h/ and /?/ production broadly support the ideas about phonetic representation expressed in Browman and Goldstein (1990) and Saltzman and Munhall (1989), as against the approach of the International Phonetic Association or The Sound Pattern of English (Chomsky and Halle 1968). Gradient or n-ary features on individual segments would not well represent the pattern of lenition observed here; for example, equally lenited /h/s can be pronounced differently in different positions, and equally voiced /h/s can represent different degrees of lenition in different positions. An intrinsically quantitative representation, oriented towards critical aspects 116
4 Comments
of articulation, appears to offer more insight than the traditional fine phonetic transcription. At the same time, the present results draw attention to the need for work on articulatory representation to include a proper treatment of hierarchical structure and its manifestations. A quantitative articulatory description will still fail to capture the multidimensional character of lenition if it handles only local phonological and phonetic properties.
Comments on chapter 4 OSAMU FUJIMURA First of all, I must express my appreciation of the careful preparation by Pierrehumbert and Talkin of the experimental material. Subtle phonetic interactions among various factors such as Fo, F p and vocal-tract constriction are carefully measured and assessed using state-of-the-art signalprocessing technology. This makes it possible to study subtle but critical effects of prosodic factors on segmental characteristics with respect to vocalsource control. In this experiment, every technical detail counts, from the way the signals are recorded to the simultaneous control of several phonological conditions. Effects of suprasegmental factors on segmental properties, particularly of syntagmatic or configurational factors, have been studied by relatively few investigators beyond qualitative or impressionistic description of allophonic variation. It is difficult to prepare systematically controlled paradigms of contrasting materials, partly because nonsense materials do not serve the purpose in this type of work, and linguistic interactions amongst factors to be controlled prohibit an orthogonal material design. Nevertheless, this work exemplifies what can be done, and why it is worth the effort. It is perhaps a typical situation of laboratory phonology. The general point this study attempts to demonstrate is that so-called "segmental" aspects of speech interact strongly with "prosodic" or "suprasegmental" factors. Paradoxically, based on the traditional concept of segment, one might call this situation "segmental effects of suprasegmental conditions." As Pierrehumbert and Talkin note, such basic concepts are being challenged. Tones as abstract entities in phonological representations manifest themselves in different fundamental frequencies. Likewise, phonemes or distinctive-feature values in lexical representations are realized with different phonetic features, such as voiced and voiceless or with and without articulatory closure, depending on the configuration (e.g. syllable- or wordinitial vs. final) and accentual situations in which the phoneme occurs. The 117
Gesture
same phonetic segments, to the extent that they can be identified as such, may correspond to different phonological units. Pierrehumbert and Talkin, clarifying the line recently proposed by Pierrehumbert and Beckman (1988), use the terms 'structure' and 'content' to describe the general framework of phonological/phonetic representations of speech. The structure, in my interpretation (Fujimura 1987, 1990), is a syntagmatic frame (the skeleton) which Jakobson, Fant, and Halle (1952) roughly characterized by configurational features. The content (the melody in each autosegmental tier) was described in more detail in distinctive (inherent and prosodic) features in the same classical treatise. Among different aspects of articulatory features, Pierrehumbert and Talkin's paper deals with voice-source features, in particular with glottal consonants functioning as the initial margin of syllables in English. What is called a glottal stop is not very well understood and varies greatly. The authors interpret acoustic EGG signal characteristics to be due to braced configurations of the vocal folds. What they mean by "braced" is not clear to me. They "surmise" that the subject DT in particular produces the glottal stop by bracing or tensing partially abducted vocal folds in a way that tends to create irregular vibration without a fully closed phase. Given the current progress of our understanding of the vocal-fold vibration mechanism and its physiological control, and the existence of advanced techniques for direct and very detailed optical observation of the vocal folds, such qualitative and largely intuitive interpretation will, I hope, be replaced by solid knowledge in the near future. Recent developments in the technique of high-speed optical recording of laryngeal movement, as reported by Kiritani and his co-workers at the University of Tokyo (RILP), seem to promise a rapid growth of our knowledge in this area. A preliminary study using direct optical observation with a fiberscope (Fujimura and Sawashima 1971) revealed that variants of American English /t/ were accompanied by characteristic gestures of the false vocal folds. Physiologically, laryngeal control involves many degrees of freedom, and EGG observations, much less acoustic signals, reveal little information about specific gestural characteristics. What is considered in the sparse distinctivefeature literature about voice-source features tends to be grossly impressionistic or even simply conjectural with respect to the production-control mechanisms. The present paper raises good questions and shows the right way to carry out an instrumental study of this complex issue. Particularly in this context, Pierrehumbert and Talkin's detailed discussion of their speech material is very timely and most welcome, along with the inherent value of the very careful measurement and analysis of the acoustic-signal characteristics. This combination of phonological (particularly intonation-theoretical) competence and experimental-phonetic (particularly speech-signal engineer118
4 Comments
ing) expertise is a necessary condition for this type of study, even just for preparing effective utterances for examination. Incidentally, it was in these carefully selected sample sentences that the authors recently made the striking discovery that a low-tone combination of voice-source characteristics gives rise to a distinctly different spectral envelope (personal communication). One of the points of this paper that particularly attracts my attention is the apparently basic difference between the two speakers examined. In recent years, I have been impressed by observations that strong interspeaker variation exists even in what we may consider to be rather fundamental control strategies of speech production (see Vaissiere 1988 on velum movement strategies, for example). One may hypothesize that different production strategies result in the same acoustic or auditory consequence. However, I do not believe this principle explains the phenomena very well, even though in some cases it is an important principle to consider. In the case of the "glottal stop," it represents a consonantal function in the syllable onset in opposition to /h/, from a distributional point of view. Phonetically (including acoustically), however, it seems that the only way to characterize this consonantal element of the onset (initial demisyllable) is that it lacks any truly consonantal features. This is an interesting issue theoretically in view of some of the ideas related to nonlinear phonology, particularly with respect to underspecification. The phonetic implementation of unspecified features is not necessarily empty, being determined by coarticulation principles only, but can have some ad hoc processes that may vary from speaker to speaker to a large extent. In order to complete our description of linguistic specification for sound features, this issue needs much more attention and serious study. In many ways this experimental work is the first of its kind, and it may open up, together with some other pioneering work of similar nature, a new epoch in speech research. I could not agree more with Pierrehumbert and Talkin's conclusion about the need for work on articulatory representation to include a proper treatment of hierarchical structure and its manifestations. Much attention should be directed to their assertion that a quantitative articulatory description will still fail to capture the multidimensional character of lenition if it handles only the local phonological and phonetic properties. But the issue raised here is probably not limited to the notion of lenition.
119
Gesture
Comments on chapters 3 and 4 LOUIS GOLDSTEIN Introduction
The papers in this section, by Pierrehumbert and Talkin and by Beckman, Edwards, and Fletcher, can both be seen as addressing the same fundamental question: namely, how are the spatiotemporal characteristics of speech gestures modulated (i.e., stretched and squeezed) in different prosodic environments?* One paper examines a glottal gesture (laryngeal abduction/ adduction for /h/- Pierrehumbert and Talkin), the other an oral gesture (labial closure/opening for /p/- Beckman, Edwards, and Fletcher). The results are similar for the different classes of gestures, even though differences in methods (acoustic analysis vs. articulator tracking) and materials makes a point-by-point comparison impossible. In general, the studies find that phrasal accent increases the magnitude of a gesture, in both space and time, while phrasal boundaries increase the duration of a gesture without a concomitant spatial change. This striking similarity across gestures that employ anatomically distinct (and physiologically very different) structures argues that general principles are at work here. This similarity (and its implications) are the focus of my remarks. I will first present additional evidence showing the generality of prosodic effects across gesture type. Second, I will examine the oral gestures in more detail, asking how the prosodic effects are distributed across the multiple articulators whose motions contribute to an oral constriction. Finally, I will ask whether we yet have an adequate understanding of the general principles involved. Generality of prosodic effects across gesture type
The papers under discussion show systematic effects of phrasal prosodic variables that cut across gesture type (oral/laryngeal). This extends the parallelism between oral and laryngeal gestures that was demonstrated for word stress by Munhall, Ostry, and Parush (1985). In their study, talkers produced the utterance /kakak/, with stress on either the first or second syllable. Tongue-lowering and laryngeal-adduction gestures for the intervocalic /k/ were measured using pulsed ultrasound. The same effects were observed for the two gesture types: words with second-syllable stress showed larger gestures with longer durations. In addition, their analyses showed that •This work was supported by NSF grant BNS 8820099 and NIH grants HD-01994 and HD13617 to Haskins Laboratories. 120
3 and 4 Comments
the two gesture types had the same velocity profile, a mathematical characterization of the shape of curve showing how velocity varies over time in the course of the gestures. On the basis of this identity of velocity profiles, the authors conclude that "the tongue and vocal folds share common principles of control" (1985: 468). Glottal gestures involving laryngeal abduction and adduction may occur with a coordinated oral-consonant gesture, as in the case of the /k/s analyzed by Munhall, Ostry, and Parush, or without such an oral gesture, as in the /h/s analyzed by Pierrehumbert and Talkin. It would be interesting to investigate whether the prosodic influences on laryngeal gestures show the same patterns in these two cases. There is at least one reason to think that they might behave differently, due to the differing aerodynamic and acoustic consequences. Kingston (1990) has argued that the temporal coordination of laryngeal and oral gestures could be more tightly constrained when the oral gesture is an obstruent than when it is a sonorant, because there are critical aerodynamic consequences of the glottal gesture in obstruents (allowing generation of release bursts and frication). By the same logic, we might expect the size (in time and space) of a laryngeal gesture to be relatively more constrained when it is coordinated with an oral-obstruent gesture than when it is not (as in /h/). The size (and timing) of a laryngeal gesture coordinated with an oral closure will determine the stop's voice-onset time (VOT), and therefore whether it is perceived as aspirated or not, while there are no comparable consequences in the case of /h/. On the other hand, these differences may prove to be irrelevant to the prosodic effects. In order to test whether there are differences in the behavior of the laryngeal gesture in these two cases, I compared the word-level prosodic effects in Pierrehumbert and Talkin's /h/ data (some that were discussed by the authors and others that I estimated from their graphs) with the data of a recent experiment by Cooper (forthcoming). Cooper had subjects produce trisyllabic words with varying stress patterns (e.g. percolate, passionate, Pandora, permissive, Pekingese), and then reproduce the prosodic pattern on a repetitive /pipipip/ sequence. The glottal gestures in these nonsense words were measured by means of transillumination. I was able to make three comparisons between Cooper and Pierrehumbert and Talkin, all of which showed that the effects generalized over the presence or absence of a coordinated oral gesture. (1) There is very little difference in gestural magnitude between word-initial and wordmedial positions for a stressed /h/, hawkweed vs. mahogany. (2) There is, however, a word-position effect for unstressed syllables (hibachi shows a larger gesture than Omaha or tomahawk). (3) The laryngeal gesture in word-initial position is longer when that syllable is stressed than when it 121
Gesture
is unstressed (hawkweed vs. hibachi). All of these effects can be seen in Cooper's data. In addition, Cooper's data show a very large reduction of the laryngeal gesture in a reduced syllable immediately following the stressed vowel (in the second /p/ of utterances modeled after percolate and passionate). In many cases, no laryngeal spreading was observable at all. While this environment was not investigated by Pierrehumbert and Talkin, they note that such /h/s have been considered by phonologists as being deleted altogether (e.g. vehicle vs. vehicular). The coincidence of these effects is again striking. Moreover, this is an environment where oral gestures may also be severely reduced: tongue-tip closure gestures reduce to flaps (Kahn 1976). Thus, there is strong parallelism between prosodic effects on laryngeal gestures for /h/ and on those that are coordinated with oral stops. This similarity is particularly impressive in face of the very different acoustic consequences of laryngeal gesture in the two cases: generation of breathy voice (/h/) and generation of voiceless intervals. It would seem, therefore, that it is the gestural dynamics themselves that are being directly modulated by stress and position, rather than the output variables such as VOT. The changes can be stated most generally at the level of gestural kinematics and/or dynamics. Articulatory locus of prosodic effects for oral gestures
The oral gestures analyzed by Beckman, Edwards, and Fletcher are bilabial closures and openings into the following vowel. Bilabial closures are achieved by coordinated action of three separate articulatory degrees of freedom: jaw displacement, displacement of the lower lip with respect to the jaw, and displacement of the upper lip with respect to the upper teeth. The goal of bilabial closure can be defined in terms of the vertical distance between the upper and lower lips, which needs to be reduced to zero (or to a negative value, indicating lip compression). This distance has been shown to be relatively invariant for a bilabial stop produced in different vowel contexts (Sussman, MacNeilage, and Hanson 1973: Macchi 1988), while the contributions of the individual articulators vary systematically - high-vowel contexts show both higher jaw positions and less displacement of the lower lip with respect to the jaw than is found in low-vowel contexts. Tokento-token variation shows a similarly systematic pattern (Gracco and Abbs 1986). This vertical interlip distance, or lip aperture, is an example of what we call "vocal-tract variables" within the computational gestural model being developed at Haskins Laboratories (Browman and Goldstein 1986, 1990; Saltzman 1986; Saltzman et al. 1988a). Gestures are the primitive 122
3 and 4 Comments
phonological units; each is modeled as a dynamical system, or control regime, whose spatial goals are defined in terms of tract variables such as these. When a given gesture is active, the individual articulatory components that can contribute to a given tract variable constitute a "coordinative structure" (Kelso, Saltzman, and Tuller, 1986) and cooperate to achieve the tract-variable goal. Individual articulators may compensate for one another, when one is mechanically restrained or involved in another concurrent speech gesture. In this fashion, articulatory differences in different contexts are modeled (Saltzman 1986). With respect to the prosodic effects discussed by Beckman, Edwards, and Fletcher, it is important to know whether the gesture's tract-variable goals are being modified, or rather if some individual articulator's motions are being amplified or reduced, and if so, whether these changes are compensated for by other articulators. Since Beckman, Edwards, and Fletcher present data only for the jaw, the question cannot be answered directly. However, Macchi (1988) has attempted to answer this question using data similar to the type employed by them. Macchi finds that prosodic variation (stress, syllable structure) influences primarily the activity of the jaw, but that, unlike variation introduced by vowel environment, the prosodic effects on the jaw are not compensated for by the displacement of the lower lip with respect to the jaw. The displacement remains invariant across prosodic contexts. Thus, the position of the lower lip in space does vary as a function of prosodic environment, but this variation is caused almost exclusively by jaw differences. That is, the locus of the prosodic effects is the jaw. Unifying oral and laryngeal effects
If Macchi's analysis is correct (and generalizes beyond the speakers and utterances analyzed), it fits well with Beckman, Edwards, and Fletcher's characterization of prosodic effects in terms of a sonority variable. It suggests that it would be possible to develop a model in which "segmental" structure is modeled by gestures defined in the space of tract variables, such as Lip Aperture, while prosodic effects would be modeled by long-term "gestures" defined in the space of Sonority, which would be related directly to jaw position. The association between the jaw height and sonority (or vocal-tract "openness") is an attractive one (although Keating [1983] presents some problems with using it as a basis for explaining why segments order according to the sonority hierarchy). A major problem with this view emerges, however, if we return to the 123
Gesture
laryngeal data. Here, as we have seen, the effects parallel those for oral gestures, in terms of changes in gesture magnitude and duration. Yet it would be hard to include the laryngeal effects under the rubric of sonority, at least as traditionally defined. Phrasal accent results in a more open glottis, but a more open glottis would result in a less sonorous output. Thus the sonority analysis fails to account, in a unified way, for the parallel prosodic modulations of the laryngeal and oral gestures. An alternative analysis would be to examine the effects in terms of the overall amount of energy expended by the gestures, accented syllables being more energetic. However, this would not explain why the effects on oral gestures seem to be restricted to the jaw (which is explained by the sonority account). Finding a unified account of laryngeal and oral effects remains an exciting challenge.
Comments on chapters 3 and 4 IRENE VOGEL Prosodic structure
Chapters 3 and 4, in common with those in the prosody section of this volume, all view the structure of phonology as consisting of hierarchically arranged phonological, or prosodic, constituents.* The phonetic phenomena under investigation, furthermore, are shown to depend crucially on such constituents in the correct characterization of their domains of application. Of particular importance is the position of specific elements within the various constituents. As Pierrehumbert and Talkin suggest, "phonetic realization rules in general can be sensitive to prosodic structure," whether they deal with tonal, segmental, or, presumably, durational phenomena. In fact, a large part of the phonologyphonetics interface seems to involve precisely the matching up of the hierarchical structures - the phonology - and the physical realizations of the specific tonal, segmental, and durational phenomena - the phonetics. This issue - the role of phonological constituents in phonetics implementation - leads directly to the next point: precisely, what are the phonological constituents that are relevant for phonetics? A common view of phonology This was presented at the conference as a commentary on several papers, but because of the organization of the volume appears here with chapters 3 and 4.
124
3 and 4 Comments
groups speech sounds into the following set of constituents (from the word up): (1) phonological utterance; intonational phrase; phonological phrase; clitic group; phonological word; Phonological constituents referred to in this volume, however, include the following: (2) (a) Pierrehumbert and Talkin: (intonational) phrase (phonological) word (b) Beckman, Edwards, and Fletcher: (intonation) phrase (phonological word) (c) Kubozono: major phrase minor phrase (d) van den Berg, Gussenhoven, and Rietveld: association domain association domain' Given such an array of proposed phonological constituents, it is important to stop and ask a number of basic questions. First of all, do we expect any, or possibly all, of the various levels of phonological structure to be universal? If not, we run the risk of circularity: a phenomenon P in some language is found to apply within certain types of strings which we thus define as a phonological constituent C; C is then claimed to be motivated as a phonological constituent of the language because it is the domain of application of P. In so doing, however, we lose any predictive power phonological constituents may have, not to mention the fact that we potentially admit an infinite number of language types in terms of their phonological constituent structure. It would thus be preferable to claim that there is some finite, independently motivated, universal set of phonological constituents. But what are these? The constituents in (1) were originally proposed and motivated as such primarily on the basis of phonological rules (e.g. Selkirk 1978; Nespor and Vogel 1986). In various papers in this volume we find phonetic data arguing in favor of phonological constituents, but with some different names. Pierrehumbert and Talkin as well as Beckman, Edwards, and Fletcher assume essentially the structures in (1), though they do not state what 125
Gesture
definitions they are using for their constituents. In Kubozono's paper, however, we find major phrase and minor phrase, and, since neither is explicitly defined, we do not know how they relate to the other proposed constituents. Similarly, van den Berg, Gussenhoven, and Rietveld explicitly claim that their association domain and association domain' do not coincide with any phonological constituents proposed elsewhere in the literature. Does this mean we are, in fact, adopting the position that essentially anything goes, where we just create phonological constituent structure as we need it? Given the impressive cross-linguistic insights that have been gained in phonology by identifying a small finite set of prosodic constituents along the lines of (1), it would be surprising, and dismaying, if phonetic investigation yielded significantly different results. Phonology and phonetics
It might be said that anything that is rule-governed and thus predictable is part of competence and should therefore be considered phonology. If this is so, one could also argue that "phonetic implementation rules" are phonology since they, too, follow rule-governed patterns (expressed, for example, in parametric models). Is phonetics, then, just the "mechanical" part of the picture? This is probably too extreme a position to defend, but it is not clear how and where exactly we are to draw the line between what is phonological and what is phonetic. One of the stock (if simplified) answers to this question is something like the following: phonology deals with unique, idealized representations of speech, while phonetics deals with their actual manifestations. Since in theory infinite variation is possible for any idealized phonological representation, a question arises as to how we know whether particular variations are acceptable for a given phenomenon. Which variations do we consider in our research and which may/must we exclude? Some of the papers in this volume report that it was necessary to set aside certain speakers and/or data because the speakers were unable to produce the necessary phenomena. This is all the more surprising since the data were collected in controlled laboratory settings where we would expect there to be less variation that usual. Furthermore, in more than one case, it seems that the data found to be most reliable and crucial to the study were produced by someone involved in the research. This is not meant necessarily as a methodological criticism, since much can be gained by examining constrained sets of data. It does, however, raise serious questions about interpretation of the results. Moreover, if in phonetic analyses, too, we have to pull back from the reality of variation, we blur the distinction between phonology and phonetics, since abstraction and idealization may no longer be considered 126
3 and 4 Comments
defining characteristics of phonology. Of course, if the goal of phonetics is to model specific phenomena, as is often the case, we do need to end up with abstractions again. We must still ask, though, what is actually being modeled when this model itself is based on such limited sets of data.
127
5 On types of coarticulation NIGEL HEWLETT and LINDA SHOCKEY
5.1 Introduction
With few exceptions, both phonetics and phonology have used the "segment" as a basis of analysis in the past few decades.* The phonological segment has customarily been equated with a phonemic unit, and the nature of the phonetic segment has been different for different applications. Coarticulation has, on the whole, been modeled as intersegmental influences, most frequently for adjacent segments but also at a greater distance, and the domain of coarticulatory interactions has been assumed to be controlled by cognitive (language-specific, structural, phonological) factors as well as by purely mechanical ones. As such, the concept of coarticulation has been a useful conceptual device for explaining the gap between unitary phonological representations and variable phonetic realizations. Less segment-based views of phonological representation and of speech production/perception have led to different ways of talking about coarticulation (see, for example, Browman and Goldstein, this volume). We are concerned here with examining some segment-based definitions of coarticulation, asking some questions about the coherence of these definitions, and offering some experimental evidence for reevaluation of terminology. 5.1.1 Ways of measuring coarticulation
Previous studies show a general agreement that coarticulation consists of intersegmental influences. However, different studies have used different *We are grateful to Peter Ladefoged, John Ohala, and Christine Shadle for their comments and advice about this work. We also thank Colin Watson of Queen Margaret College for his work in designing and building analysis equipment. 128
5 Nigel Hewlett and Linda Shockey
approaches to measuring these influences. We will compare two of these approaches. One could be called "testing for allophone separation." A number of studies (Turnbaugh et al. 1985; Repp 1986; Sereno and Lieberman 1987; Sereno et al. 1987; Hewlett 1988; Nittrouer et al. 1989) have examined spectral characteristics of the consonant in CV syllables by adults and children with a view to estimating the amount of influence of the vowel on the realization of the consonant (typically a voiceless stop), i.e. to determine whether and to what extent realizations of the same consonant phoneme were spectrally distinct before different vowels. The precise techniques are not at issue here. Some have involved simply measuring the frequency of the tallest peak in the spectrum or identifying whether the vowel F2 is anticipated in the consonant spectrum, others have used more elaborate techniques. A recent and extensive investigation of this type is that of Nittrouer, StuddertKennedy, and McGowan (1989). In this study, the centroids of /s/ spectra in /si/ and /su/ syllables, as spoken by young children and adults, were calculated and compared. The authors conclude that the fricative spectrum is more highly influenced by a following vowel in the children's productions than in the adults'. Another approach could be termed "finding the target-locus relationship" (Lindblom and Lindgren 1985; Krull 1987, 1989). According to this criterion, the difference between F2 at vowel onset (the "locus") and F2 at midvowel (the "target") is inversely related to the amount of CV coarticulation present. The reasoning here is that coarticulation is a process which makes two adjacent sounds more similar to each other, so if it can be shown that there is a shallower transition between consonant and vowel in one condition than in another, then that condition can be thought of as more coarticulated. This approach has been applied to a comparison of more careful and less careful speech styles and the findings have indicated greater coarticulation associated with a decrease in care. It is difficult to use the first approach (measuring allophone separation) in dealing with connected speech because of yet another phenomenon, which is often termed "coarticulation": that of vowel centralization. It is well known that on average the vowel space used in conversation is much reduced from that found for citation-form speech. Sharf and Ohde (1981) distinguish two types of coarticulation which they call "feature spreading" and "feature reduction," the latter of which includes vowel centralization. If one wants to compare coarticulation in citation form with coarticulation in connected speech by testing for allophone separation, one's task is very much complicated by the fact that two variables are involved: difference in phonological identity of the vowel and difference in overall vowel-space shape. The same objection cannot be made to the method of finding the target-locus relationship because it is insensitive to the difference 129
Gesture
between (for example) a "change up" in target and a "change down" in the locus. As far as we know, no previous study has attempted to determine whether the application of the two approaches to the same data leads to a similar conclusion concerning the relative degree of coarticulation in each case. 5.2 The experiment
5.2.1 Method The study described here was designed to test whether there is a difference in degree of coarticulation in CV syllables between (1) very carefully spoken (citation-form) syllables and (2) the same syllables produced in connected (read) speech. One speaker's productions of /k/ and /t/ before the vowels /i/ and /u/ were investigated, /k/ was chosen because its allophonic variation before front and back vowels is well established. The subject was a linguistically naive male speaker in his thirties who spoke with a Received Pronunciation (RP) accent, though a long-term resident of Edinburgh. In younger speakers of RP, the vowel /u/ is often fronted: this subject may show an exaggeration of this trend as a result of his residence in Scotland, where /u/ is notably fronted phonetically ([«]). A passage of text was composed in such a way as to contain eight words beginning with the sequences: /ki/, /ku/, /ti/, /tu/; a total of thirty-two words. This text is reproduced as the appendix to this chapter (p. 137), with the experimental words in bold print. All words are either monosyllabic or, where they are disyllabic, the relevant CV sequence occurs in an initial stressed syllable. The items were chosen so as to be sufficiently contentful not to undergo vowel reduction in connected speech. The subject was asked to read through the text twice, with no special instructions being given concerning style. He had no knowledge of the purpose of the experiment or the identity of the experimental words. He then pronounced the words key, coo, tea, and two sixteen times each, in randomized order and at an interval of about 5 seconds, as cued by the words written on index cards. He was asked to say the words "very carefully". The session thus yielded sixteen each of what will subsequently be called "citation forms" and "reading forms" of each of the CV forms cited above. These were recorded using a high-quality digital cassette recorder. Digitized waveforms of the experimental items were stored on a Hewlett-Packard Vectra computer, using the A-to-D facility of a Kay Digital Sonagraph, with a sampling rate of 25,600 samples per second. Fourier spectra were obtained of the initial part of each consonant burst, using a 20 msec, time frame with a Gaussian window. The frequency of the most prominent peak was measured 130
5 Nigel Hewlett and Linda Shockey
Table 5.1 Mean frequencies in Hz of the most prominent peak of consonant burst, second formant of the vowel at onset and secondformant of the vowel at steady state Citation
Burst Peak Vowel Onset F2 Steady state
Reading
[ki]
[ku]
[ki]
[ku]
3,046 (248)
2,020 (160)
2,536 (167)
1,856 (93)
2,226 (90)
1,648 (228)
2,083 (138)
1,579 (227)
2,375 (131)
1,403 (329)
2,127 (112)
1,585 (239)
Burst frequencies for velars
3 -100 3,200. 3,000. X 2,800-
T
•^ 2,600,
§ 2,400§" 2,200,
i
LL
2,000 •
1,800, 1 600-
ki (citation)
ki (reading)
ku (citation)
I ku (reading)
Figure 5.1 Means and standard deviations of velar bursts
in each case. Where there was more than one very prominent peak, spectra of subsequent time frames were examined in order to determine which pole was most robust overall. The frequencies of the first three formants of each vowel were measured at onset and in the middle, using the continuous spectrogram facility of a Spectrophonics system, which produces traditional broad-band spectral displays on a CRT (cathode ray tube). Where frequencies were in doubt due 131
Gesture
Table 5.2 Mean frequencies in Hz of the most prominent peak of consonant burst, second formant of the vowel at onset and secondformant of the vowel at steady state Reading
Citation
Burst Peak F2 Vowel Onset
"requency in Hz
F2 Steady state
7 Rnn
•
7,000-
•I
[ti]
[tu]
[ti]
[tu]
6,264 (714)
6,234 (905)
4,780 (705)
4,591 (848)
2,104 (140)
1,666 (63)
2,012 (95)
1,635 (169)
2,320 (104)
1,517 (155)
2,075 (114)
1,596 (226)
Burst frequencies for alveolars
6,500-
<
<
6,0005,5005,000'
<
4,500 4,000o cr\r\. O,C)UU
ti (citation)
ti (reading)
tu (citation)
tu (reading)
Figure 5.2 Means and standard deviations of alveolar bursts
to there being two formants close together or an unusually weak formant, spectral sections were used, since they allow for finer discrimination of formant amplitude. 5.2.2 Results The /k/ spectra were well differentiated in shape from the /t/ spectra, regardless of the style or following vowel, and in the expected fashion. That 132
5 Nigel Hewlett and Linda Shockey
is, /k/ spectra were characterized by a prominent peak at mid-frequency and /t/ spectra by a concentration at higher frequencies. This can be seen in the frequencies of the most prominent peak of the burst spectra, which ranged from 1,656 to 3,500 Hz for /k/ (mean = 2,364) and from 3,437 to 7,703 Hz for /t/ (mean = 5,467). Table 5.1 gives the means and standard deviations of the most prominent peak of the burst spectra, the second formant of the vowel at its onset, and the second formant of the middle of the vowel for each of the four experimental conditions, for the velar stops. Figure 5.1 shows a graphic representation of the means and standard deviations of the burst frequencies for all velars. Table 5.2 and figure 5.2 give the same information for the alveolars. The /k/ spectra reveal both a strong vowel effect and a strong effect from speech style: in all cases the /k(u)/ spectra have a lower frequency peak than the analogous /k(i)/ spectra. The /k(i)/ and /k(u)/ spectra were widely separated by the measure of burst peak frequency in the citation forms. They were also separated, to a lesser extent, in the reading forms. The /k(i)/ reading forms had a lower burst peak frequency than in the citation forms, as predicted for a situation of lesser coarticulation. The /k(u)/ reading forms, however, also had a lower burst peak frequency compared to the /k(u)/ citation forms: in this case, the difference of burst peak frequency was in the opposite direction to the difference of the F2 at the middle of the vowel (see below). A T-test revealed that the /k/ forms were significantly different from each other as regards both phonetic environment and style. The /k/ releases were also highly significantly different from the /t/ releases. The /t/ spectra show no significant effect from the following vowel, but do reveal the same strong effect from speech style: both /t(i)/ and /t(u)/ have a lower burst peak frequency in the reading mode than in the citation mode. Thus all CV forms showed a lowering of burst frequency in the reading forms. The citation-form /t/ releases proved significantly different from the reading /t/ releases when a T-test was applied, but within conditions the differences were not significant, i.e. the vocalic environment did not figure statistically as a differentiator. (See table 5.3 for T-test values.) With regard to the vowels, centralization was observed in reading mode, as can be seen in figure 5.3. Average formant frequency at vowel onset, however, showed a different pattern, with F2 being lower in all cases for the reading forms than is seen for citation forms. In this respect, the vowel onset reflects the pattern seen at consonant release. Figure 5.4 shows relative positions of averaged tokens in F,-F 2 space at vowel onset.
133
Gesture Table 5.3 T-Test results for comparisons described in column 1 (df = 75 throughout) Variable (release from)
T-value
* = Probability less than 0.001
citation [ki]/citation [ku] read [ki]/read [ku] citation [ki]/read [ki] citation [ku]/read [ku] citation [ti]/citation [tu] read [ti]/read [tu] citation [ti]/read [ti] citation [tu]/read [tu] citation [ki]/citation [ti] read [ki]/read [ti] citation [ku]/citation [tu] read [ku]/read [tu]
13.7 16.1
*
kHz
u.o
3.0
2.4
2.0
* * *
7.7 4.2
-0.03 0.63 4.9 5.8
* * * * * *
-17.2 -12.3 -19.3 -12.4
1.6
1.2
1.0
0.8
— ki
0 4 -•--
l if •
t
tu
•
tu
ku
•ku
nR n« • citation • reading Figure 5.3 Formant frequencies at middle of vowel
134
5 Nigel Hewlett and Linda Shockey
kHz
3.0
2.4
2.0
1.6
1.2
1.0
0.8
0.3 k
0.4
|
i"
•™
ti ki
cu
tu W tu
0.5 0.6 • citation # reading
Figure 5.4 Formant frequencies at vowel onset
5.3 Two types of coarticulation?
Testing for allophone separation gives us a negative or marginal result based on these data: the major burst peaks for the two variants of/k/ are separated by 1,026 Hz in very careful speech and by 780 Hz in read speech, so they could be said to show less, rather than more, coarticulation in the connected speech. However, standard deviations are very large relative to the size of the effect and we have verified that we have vowel centralization and therefore know that the vowel targets are also closer together in the connected speech. It would therefore be difficult to make a convincing case for any significant difference between the /k/ spectra in the two styles. In addition, we find contradictory results for /t/: allophone separation is less in citation form than in read speech (citation-form difference = 30 Hz, read-speech difference = 209 Hz), even though the vowel centralization is about the same in the two read-speech corpora. Allophone separation criteria do not, in this case, give us a satisfactory basis for claiming a difference in coarticulation in the two styles. When we use locus-target difference as a criterion, however, the results give us a completely different picture. The difference between F2 frequency at vowel onset and vowel steady state is consistently much greater for citationform speech than for reading, as can be seen from table 5.4. Based on this criterion, there is consistently more coarticulation in connected speech. 135
Gesture
Table 5.4 Difference between the frequencies ofF2 of the vowel steady state and vowel onset (Hz)
Citation form Read form
[ki]
[ku]
[ti]
[tu]
+149 +44
-245 +6
+216 +63
-149 -39
The question which immediately arises is: how can reading and citation forms be simultaneously not different and different with respect to coarticulation? A similar anomalous situation can be adduced from the literature on acquisition by children: studies such as that by Nittrouer et al. (1989), based on allophone separation, suggest that greater coarticulation is found in earlier stages of language development and that coarticulation decreases with age. Kent (1983), however, points to the fact that if greater coarticulation is associated with greater speed and fluency of production, it would be liable to increase with greater motor skill and hence with the age of children. He observes that this is compatible with the finding that children's speech is slower than adults', and he offers as evidence the fact that the F 2 trajectory in sequences such as /oju/ (in the phrase We saw youl) is flatter in adults' pronunciation than in children's, indicating greater coarticulation in adult speech. The criterion used in this argument is comparable to that of the target-locus relation. These contradictory findings suggest the possibility that F2 trajectories and allophone separation are really measuring two different kinds of coarticulation which behave differently in relation to phonetic development as well as in relation to speech style. Pertinent to this point is a fact which is well known but often not mentioned in discussions of coarticulation: very carefully articulated speech manifests a great deal of allophone separation. Striking examples are frequently seen to come from items pronounced in isolation or in frame sentences (e.g. Daniloff, Shuckers, and Feth 1980: 328-33). Further evidence of a large amount of allophone separation in careful speech comes from the search for acoustic invariance (Blumstein and Stevens 1979). The extent of the differences in the spectra of velar consonants in particular before different vowels testifies to the amount of this type of coarticulation in maximally differentiated tokens. Whatever sort of coarticulation may be measured by allophone separation, it seems very unlikely to be the sort which is increased by increases in speech rate or in degree of casualness. We assume it to reflect local variation in lip/tongue/jaw movement. As such, it could be easily accommodated in the gestural framework suggested by Browman and Goldstein (1986, this volume). 136
5 Nigel Hewlett and Linda Shockey
Our results show the characteristic vowel centralization which is normally attributed to connected speech. We have found, in addition, another quite general effect which might, if found to be robust, come to be considered coarticulation of the same sort as vowel centralization; this is the marked lowering of burst frequencies in read speech. The cause of this lowering has yet to be discovered experimentally. Suggested reasons are: (1) that in connected speech /ki/ and /ku/ are liable to be produced with a more open jaw and therefore a larger cavity in front of the release, which would have the effect of lowering the frequency (Ladefoged, personal communication); and (2) that since there is probably greater energy expended in the production of citation forms, it is very likely that there is a greater volume velocity of airflow (Ohala, personal communication). Given a higher rate of flow passing through approximately the same size aperture, the result is a higher center frequency in the source spectrum of the burst. These explanations are not incompatible and both are amenable to experimental investigation. It is quite likely that vowel centralization, lowering of burst frequencies, and flattening of locus-target trajectories in connected speech are all parts of the same type of long-term effect (which we may or may not want to term "coarticulation") which can be attributed to larger mandible opening combined with smaller mandible movements. It certainly appears in our data that lowered vowel-onset frequencies (which are directly linked to lowered burst frequencies) and vowel centralization conspire to produce flatter trajectories, but this hypothesis requires further investigation. Such longterm effects would be more difficult to explain using a gestural model, but could possibly be described as a style-sensitive overall weighting on articulator movement. How style can, in practice, be used to determine this weighting is an open question. Our results and the others discussed above support Sharf and Ohde's (1981) notion that it is fruitful to divide what is currently called "coarticulation" into at least two separate areas of study: relatively short-term effects and longer-term settings. In addition, our results suggest that the former may not be much influenced by differences in style while the latter show strong style effects.
Appendix: text of read story "Just relax," said Coutts. "Now we can have a proper chat." "To begin with," he went on coolly, "I don't appreciate the intruders." The gun that Coutts was holding was a .32 Smith and Wesson, Keith noted. It was its function rather than its make which mattered, of course, but noting 137
Gesture
such details helped to steady his nerves. To the same end, he studied the titles of the books that were propping the broken sash window open at the bottom, providing a welcome draught of air into the room. The Collected Poems of T.S. Eliot were squashed uncomfortably between A Tale of Two Cities and Teach Yourself Icelandic.
"Perhaps you'd like to tell me the purpose of this unexpected visit?" Courts smiled, but only with his teeth. The eyes above them remained perfectly expressionless. "You know already. I want my client's document back. And the key that goes with it." "Now what key would that be, I wonder?" "About two inches long, grey in colour and with BT5024 stamped on the barrel." "Oh, that key." "So, you know about this too," Courts mused. "Well, we've got lots of time to talk about it. As it happens, I'm willing to devote the rest of my afternoon to your client's little problem." He laughed again, a proper laugh this time, which revealed a badly chipped tooth. That might have been Stella's handiwork of the previous day with the teapot. There was a photograph of her which was lying on its back on the highly polished teak surface of the desk. Next to it was another photograph, of a teenage boy who didn't seem to bear any resemblance to Courts.
"I'm not too keen on spending the rest of the afternoon cooped up in your pokey little office," said Keith. He tried to think of something more interesting to say, something that was guaranteed to keep Court's attention directed towards him. For just along from the Collected Poems of T.S. Eliot, a face was peering through the gap at the bottom of the window. This was somewhat surprising, since to the best of his recollection they were two floors up.
Comments on chapter 5 WILLIAM G. BARRY and SARAH HAWKINS Hewlett and Shockey are concerned with coarticulation from three different angles. Firstly, they present data from an experiment designed to allow the comparison of single-syllable CV utterances with the same CV sequences produced in a continuous-speech passage. The question is whether the degree of coarticulation in these sequences differs between speech-production conditions. Secondly, they are concerned with the methodological question 138
5 Comments
of how to quantitatively assess degree of coarticulation. Thirdly, following their title, they suggest different types of coarticulation based on different speech-production factors underlying the measured coarticulatory phenomena. The simultaneous concern with three aspects of coarticulation studies is very illuminating. The application for the first time of two different ways of quantifying CV coarticulation, using the same speech data, provides a clear illustration of the extent to which theoretical models can be the product of a particular methodological approach. Given this dependency, however, it is rather a bold step to conclude the discussion by postulating two categories of coarticulation, neatly differentiated by analytic method. This question mark over the conclusions results more from the structure of the material on which the discussion of coarticulation types is based than from the interpretation of the data. The contradictory trends found by the two analysis methods for the two production conditions might be related to separate aspects of production, namely a "local" tongue/lip/jaw effect and a "long-term" effect due to jaw movement. This is an interesting way of looking at the data, although we agree with Hewlett and Shockey's comment that the latter effect may be an articulatory setting rather than due to coarticulation. But in order to place these observations into a wider theoretical perspective, a discussion of types of coarticulation is needed to avoid post hoc definitions of type along the divisions of analytic method. We discuss two aspects of (co)articulation that are particularly relevant to Hewlett and Shockey's work:first,the types of articulatory processes that are involved in coarticulation; and second, the domains coarticulation can operate over. In an early experimental investigation of speech, Menzerath and De Lacerda (1933) distinguish "Koartikulation," involving the preparatory or perseveratory activity of an articulator not primarily involved in the current segment, and "Steuerung" ("steering" or "control"), which is the deviation of an articulator from its target in one segment due to a different target for the same articulator in a neighboring segment. This distinction not only illuminates aspects of Hewlett and Shockey's experimental material, but also points to a potential criterion for distinguishing (one type of) longer-term coarticulatory effect from more local effects. The /ki ku ti tu/ syllables can be seen to carry both factors: (1) the lip-rounding for /u/ is completely independent of the consonantal tongue articulation for /k/ and /t/, and is therefore "Koartikulation" in the Menzerath and De Lacerda sense; (2) the interaction between the tongue targets for /k/ and the two vowels, and for /t/ and the two vowels is a clear case of "Steuerung." "Steuerung" is likely to be a more local effect, in that the trajectory of a single articulator involved in consecutive "targets" will depend upon the relation between those targets. 139
Gesture
The "independent" articulator, on the other hand, is free to coarticulate with any segments to which it is redundant (see Benguerel and Cowan 1974; Lubker 1981). Note that Menzerath and De Lacerda's distinction, based on articulatory criteria, encourages different analyses from Hewlett and Shockey's, derived from acoustic measurements. The latter suggest that their more long-term effect stems from differences in jaw height, whereas in the former's system jaw involvement is primarily regarded as a local (Steuerung) effect. Thus Hewlett and Shockey's "local" effect seems to incorporate both Menzerath and De Lacerda's terms, and their own "long-term" effect is not included in the earlier system. The danger of defining coarticulatory types purely in terms of the acoustic-analytic method employed is now clear. The acoustically defined target-locus relationship cannot distinguish between "Koartikulation" and "Steuerung," which are worth separating. The emphasis on the method of acoustic analysis rather than on articulation leads to another design problem, affecting assessment of allophone separation. Allophone separation can only be assessed in the standard way (see Nittrouer et al. 1989) if there are no (or negligible) differences in the primary articulation. In Menzerath and De Lacerda's terminology, we can only assess allophone separation for cases of Koartikulation rather than Steuerung. Allophone separation can thus be assessed in the standard way for /ti/ and /tu/ but not for /ki/ and /ku/. In careful speech, the position of the tongue tip and blade for /t/ is much the same before /i/ as before /u/, but vowel-dependent differences due to the tongue body will have stronger acoustic effects on the burst as the speech becomes faster and less careful. For /ki/ and /ku/, on the other hand, different parts of the tongue body form the /k/ stop closure: the allophones are already separated as [k] and [k], even (judging from the acoustic data) for the speaker with his fronted /u/. Thus, as Hewlett and Shockey hint in their discussion, velar and alveolar stops in these contexts represent very different cases in terms of motor control, and should not be lumped together. We would therefore predict that the difference in the frequency of the peak amplitude of alveolar burst should increase as the /t/ allophones become more distinct, but should decrease for velar bursts as the /k/ allophones become less distinct. This is exactly what Hewlett and Shockey found. Hence we disagree with their claim that allophone-separation measures fail to differentiate citation and reading forms. Seen in this light, citation and reading forms in Hewlett and Shockey's data seem to differ in a consistent way for both the target-locus and the allophone-separation measure. These observations need to be verified statistically: a series of independent T-tests is open to the risk of false significance errors and, more importantly, does not allow us to assess interactions between conditions. The factorial design of the experiment described in 140
5 Comments
chapter 5 is ideally suited to analysis of variance which would allow a more sensitive interpretation of the data. Turning now to the domain of coarticulation, we question whether it is helpful to distinguish between local and long-term effects, and, if it is, whether one can do so in practice. First, we need to say what local and long term mean. One possibility is that local coarticulatory effects operate over a short timescale, while long-term effects span longer periods. Another is that local influences affect only adjacent segments while long-term influences affect nonadjacent segments. The second possibility is closest to traditional definitions of coarticulation in terms of "the overlapping of adjacent articulations" (Ladefoged 1982: 281), or "the influence of one speech segment upon another... of a phonetic context upon a given segment" (Daniloff and Hammarberg 1983: 239). The temporal vs. segmental definitions are not completely independent, and we could also consider a four-way classification: coarticulatory influences on either adjacent or nonadjacent segments, each extending over either long or short durations. Our attempts to fit data from the literature into any of these possible frameworks lead us to conclude that distinguishing coarticulatory influences in terms of local versus long-term effects is unlikely to be satisfactory. There are historical antecedents for a definition of coarticulation in terms of local effects operating over adjacent segments in Kozhevnikov and Chistovich's (1965) notion of "articulatory syllable," and in variants of feature-spreading models (e.g. Henke 1966; Bell-Berti and Harris 1981) with which Hewlett and Shockey's focus on CV structures implicitly conforms. There is also plenty of evidence of local coarticulatory effects between nonadjacent segments. Cases in point are Ohman's (1966a) classic study of vowels influencing each other across intervening stops and Fowler's (1981a) observations on variability in schwa as a function of stressed-vowel context. (In these examples of vowel-to-vowel influences, there are also, of course, coarticulatory effects on the intervening consonant or consonants.) The clarity of this division into adjacent and nonadjacent depends on how one defines "segment." If the definition involved mapping acoustic segments onto phone or phoneme strings, the division could be quite clear, but if the definition is articulatory, which it must be in any production model, then the distinction is not at all clear: units of articulatory control could overlap such that acoustically nonadjacent phonetic-phonological segments influence one another. This point also has implications for experimental design: to assess coarticulatory influences between a single consonant and vowel in connected speech, the context surrounding the critical acoustic segments must either be constant over repetitions, or sufficiently diverse that it can be treated as a random variable. Hewlett and Shockey's passage has uneven distributions of sounds around the critical segments. For example, most of the articulations 141
Gesture
surrounding /ku/ were dental or alveolar (five before and seven after the syllable, of eight repetitions), whereas less than half of those surrounding /ti/ and /tu/ were dental or alveolar. The above examples are for effects over relatively short periods of time; segmentally induced modifications to features or segments can also extend over longer stretches of time. For example, Kelly and Local (1989) have described the spread of cavity features such as velarity and fronting over the whole syllable and the foot. This spreading affects adjacent (i.e. uninterrupted strings of) segments and is reasonably regarded as segment-induced, since it occurs in the presence of particular sounds (e.g. /r/ or /I/); it is therefore categorizable as coarticulation. Long-term effects on nonadjacent segments have also been observed. Slis (personal communication) and Kohler, van Dommelen, and Timmermann (1981) have found for Dutch and French, respectively, that devoicing of phonologically voiced obstruents is more likely in a sentence containing predominantly voiceless consonants. Such sympathetic voicing or devoicing in an utterance is the result of a property of certain segments spreading to other (nonadjacent) segments. However, classifying these longer-term effects as coarticulation runs the risk that the term becomes a catch-all category, with a corresponding loss in usefulness. In their articulatory (and presumably also acoustic) manifestation there appears to be nothing to differentiate at least these latter cases of long-term effects from the acquired permanent settings of some speakers, and Hewlett and Shockey in fact point out that their longer-term effect (mandible movement) may be best regarded as an articulatory setting. Articulatory setting is probably a useful concept to retain, even though in terms of execution it may not always be very distinct from coarticulation. There are also similarities between the above long-term effects and certain connected-speech processes of particular languages, such as the apparently global tendency in French and Russian for voicing to spread into segments that are phonologically voiceless, which contrasts with German, English, and Swedish, where the opposing tendency of devoicing is stronger. If these general properties of speech are classed as coarticulation, it seems a relatively small step to include umlaut usage in German, vowel harmony, and various other phonological prosodies (e.g. Firth 1948; Lyons 1962) as forms of coarticulation. Many of these processes may have had a coarticulatory basis historically, but there are good grounds in synchronic descriptions for continuing to distinguish aspects of motor control from phonological rules and sociophonetic variables. We do not mean to advocate that the term coarticulation should be restricted to supposed "universal" tendencies and all language-, accent-, or style-dependent variation should be called something else. But we do suggest that there is probably little to be gained by 142
5 Comments
describing all types of variation as coarticulation, unless that word is used as a synonym for speech motor control. We suggest that the type of long-term effect that Hewlett and Shockey have identified in their data is linked with the communicative redundancy of individual segments in continuous speech compared to their communicative value in single-syllable citation. This relationship between communicative context and phonetic differentiation (Kohler 1989; Lindblom 1983) is assumed to regulate the amount of articulatory effort invested in an utterance. Vowel centralization has been shown to disappear in communicative situations where greater effort is required, e.g. noise (Schindler 1975; Summers et al. 1988). Such "communicative sets" would presumably also include phenomena such as extended labiality in "pouted" speech, but not permanent settings that characterize speakers or languages (see the comment on voicing and devoicing tendencies above), though these too are probably better not considered to be "coarticulation." The exclusion of such phenomena from a precise definition of coarticulation does not imply that they have no effect on coarticulatory processes. The "reduction of effort" set, which on the basis of Hewlett and Shockey's data might account for vowel undershoot and lowered release-burst frequencies, can also be invoked to explain indirectly the assimilation of alveolars preceding velars and labials. By weakening in some sense the syllable-final alveolar target, it allows the anticipated velar or labial to dominate acoustically. Of course, this goes no way towards explaining the fact (Gimson 1960; Kohler 1976) that final alveolars are somehow already more unstable in their definition than other places of articulation and therefore susceptible to coarticulatory effects under decreased effort. To conclude, then, we suggest that although there seems to be a great deal of physiological similarity between segment-induced modifications and the settings mentioned above that are linked permanently to speakers or languages, or temporarily to communicative context, it is useful to distinguish them conceptually. This remains true even though a model of motoric execution might treat them all similarly. To constrain the use of the term coarticulation, we need to include the concept of "source segment(s)" and "affected segments" in its definition. The existence of a property or feature extending over a domain of several segments, some of which are not characterized by that feature, does not in itself indicate coarticulation. Our final comment concerns Hewlett and Shockey's suggested acoustic explanation for their finding that the most prominent peaks of the burst spectra were all lower in frequency in the read speech than in the citation utterances. One of their suggested possibilities is that less forceful speech 143
Gesture
could be associated with a lower volume velocity of flow through the constriction at the release, which would lower the center frequency of the noise-excitation spectrum. The volume velocity of flow might well have been lower in Hewlett and Shockey's connected-speech condition, but we suggest it was probably not responsible for the observed differences. The bandwidth of the noise spectrum is so wide that any change in its center frequency is unlikely to have a significant effect on the output spectrum at the lips. The second suggestion is that a more open jaw could make a larger cavity in front of the closure. It is not clear what is being envisaged here, and it is certainly not simple to examine this claim experimentally. We could measure whether the jaw is more open, but if it were more open, how would this affect the burst spectrum? Hewlett and Shockey's use of the word "size" suggests that they may be thinking of a Helmholtz resonance, but this seems unlikely, given the relationship between the oral-cavity volume and lip opening that would be required. A more general model that reflects the detailed area function (length and shape) of the vocal tract is likely to be required (e.g., Fant 1960). The modeling is not likely to be simple, however, and it is probably inappropriate to attribute the observed pattern to a single cause. The most important section of the vocal tract to consider is the cavity from the major constriction to the lips. It is unclear how front-cavity length and shape changes associated with jaw height would produce the observed patterns. However, if an explanation in terms of cavity size is correct, the most likely explanation we know of is not so much that the front cavity itself is larger, as that the wider lip aperture (that would probably accompany a lowered jaw position) affected the radiation impedance.1 Rough calculations following Flanagan (1982: 36) indicate that changes in lip aperture might affect the radiation impedance by appropriate amounts at the relevant frequencies. Other details should also be taken into consideration, such as the degree to which the cavity immediately behind the constriction is tapered towards the constriction, and the acoustic compliance of the vocal-tract walls. There is probably more tapering in citation than reading form of alveolars and fronted /k/, and the wall compliance is probably less in citation than reading form. Changes in these parameters due to decreased effort could contribute to lowering the frequencies and changing the intensities of vocal-tract resonances. The contribution of each influencing factor is likely to depend on the place of articulation of the stop. To summarize, we feel that in searching for explanations of motor 1
We are indebted to Christine Shadle for pointing this out to us. 144
5 Comments
behavior from acoustic measurements, it is important to use models (acoustic and articulatory) that can represent differences as well as similarities between superficially similar things. The issues raised by Hewlett and Shockey's study merit further research within a more detailed framework.
145
Section B Segment
147
6 An introduction to feature geometry MICHAEL BROE
6.0 Introduction
This paper provides a short introduction to a theory which has in recent years radically transformed the appearance of phonological representations. The theory, following Clement's seminal (1985) paper, has come to be known as feature geometry.1 The rapid and widespread adoption of this theory as the standard mode of representation in mainstream generative phonology stems from two main factors. On the one hand, the theory resolved a debate which had been developing within nonlinear phonology over competing modes of phonological representation (and resolved it to the satisfaction of both sets of protagonists), thus unifying the field at a crucial juncture. But simultaneously, the theory corrected certain long-standing and widely acknowledged deficiencies in the standard version of feature theory, which had remained virtually unchanged since The Sound Pattern of English, (Chomsky and Halle 1968; hereafter SPE). The theory of feature geometry is rootedfirmlyin the tradition of nonlinear phonology - an extension of the principles of autosegmental phonology to the wider phonological domain - and in section 6.1 I review some of this essential background. I then show, in section 6.2, how rival modes of representation developed within this tradition. In section 6.3 I consider a related problem, the question of the proper treatment of assimilation phenomena. These two sections prepare the ground for section 6.4, which 1
Clements (1985) is the locus classicus for the theory, and the source of the term "feature geometry" itself. Earlier suggestions along similar lines (in unpublished manuscripts) by Mascaro (1983) and Mohanan (1983) are cited by Clements, together with Thrainsson (1978) and certain proposals in Lass (1976), which can be seen as an early adumbration of the leading idea. A more detailed survey of the theory may be found in McCarthy (1988), to which I am indebted. Pulleyblank (1989) provides an excellent introduction to nonlinear phonology in general. 149
Segment
shows how feature geometry successfully resolves the representation problem, and at the same time provides a more adequate treatment of assimilation. I then outline the details of the theory, and in section 6.5 show how the theory removes certain other deficiencies in the standard treatment of phonological features. I close with an example of the theory in operation: a treatment of the Sanskrit rule of «-Retroflexion or Nati. 6.1 Nonlinear phonology
Feature geometry can be seen as the latest stage in the extension of principles of autosegmental phonology - originally developed to handle tonal phenomena (Goldsmith 1976) - to the realm of segmental phonology proper. Essential to this extension is the representation of syllabicity on a separate autosegmental tier, the so-called "CV-skeleton" (McCarthy 1979, 1981; Clements and Keyser 1983). Segmental features are then associated with the slots of the CV-tier just as tones and tone-bearing units are associated in autosegmental phonology; and just as in tonal systems, the theory allows for many-to-one and one-to-many associations between the CV-skeleton and the segmental features on the 'melodic' tier. For example, contour segments - affricates, prenasalized stops, short diphthongs, and so on, whose value for some feature changes during the course of articulation - can be given representations analogous to those used for contour tones, with two different values for some feature associated with a single timing slot: (1)
H
L
V Contour tone
[-cont] [ + cont]
[ + nas] [-nas]
C Affricate
C Prenasalized stop
And conversely, geminate consonants and long vowels can be represented as a single quality associated with two CV-position: (2)
H
i
m
V V Tonal spread
V V Long vowel
C C Geminate consonant
As Clements (1985) notes, such a representation gives formal expression to the acknowledged ambiguity of such segments/sequences. Kenstowicz (1970) 150
6 Michael Broe
has observed that, for the most part, rules which treat geminates as atoms are rules affecting segment quality, while rules which require a sequence representation are general "prosodic," affecting stress, tone, and length itself. Within a nonlinear framework, the distinction can be captured in terms of rule application on the prosodic (quantitative) tier or the melodic (qualitative) tier respectively. Futher exceptional properties of geminates also find natural expression in the nonlinear mode of representation. One of the most significant of these is the property of geminate integrity: the fact that, in languages with both geminates and rules of epenthesis, epenthetic segments cannot be inserted "within" a geminate cluster. In nonlinear terms, this is due to the fact that the structural representation of a geminate constitutes a "closed domain"; it is thus impossible to construct a coherent representation of an epenthesized geminate: 1
1
(3)
c
V
V
m V
V
c
Here, the starred representation contains crossing association lines. Such representations are formally incoherent: they purport to describe a state of affairs in which one segment simultaneously precedes and follows another. It was quickly noted that assimilated clusters also exhibit the properties of true geminates. The following Palestinian data from Hayes (1986) show epenthesis failing to apply both in the case of underlying geminates and in the case of clusters arising through total assimilation: (4)
/?akl/ /Pimm/ /jisr kbiir/ /1-walad 1-zyiir/
-> [Pakil] -> [Pimm] -• [jisrikbir] —> [lwaladizzyiir]
food mother bridge big DEF-boy DEF-small
Halle and Vergnaud (1980) suggested that assimilation rules be stated as manipulations of the structural relations between an element on the melodic tier and a slot in the CV-skeleton, rather than as a feature changing operation 151
Segment
on the melodic tier itself. They provide the following formulation of Hausa regressive assimilation: (5)
/littaf+taaf+ai/
c
v
c
-•
c
v
[littattaafai]
c
c
v
v
c
v
v
Here - after the regressive spread of the [t] melody and delinking of the [f] the output of the assimilation process is structurally identical to a geminate consonant, thus accounting for the similarities in their behavior. This in turn opens up the possibility of a generalized account of assimilation phenomena as the autosegmental spreading of a melody to an adjacent CV-slot (see the articles by Hayes, Nolan, and Local this volume). However, a problem immediately arises when we attempt to extend this spreading account to cases of partial assimilation, as in the following example from Hausa (Halle and Vergnaud 1980): (6)
/gaddam + dam + ii/
->
[gaddandamii]
The example shows that, in their words, "it is possible to break the link between a skeleton slot and some features on the melody tier only while leaving links with other features intact" (p. 92): in this case, to delink place features while leaving nasality and sonority intact. Halle and Vergnaud do not extend their formalism to deal with such cases. The problem is simply to provide an account of this partial spreading which is both principled and formally adequate. The radical autosegmentalism of phonological properties gives rise to a general problem of organization. As long as there is just one autosegmental tier - the tonal tier organized in parallel to the basic segmental sequence, the relation between the two is straightforward. But when the segmental material is itself autosegmentalized on a range of tiers - syllabicity, nasality, voicing, continuancy, tone, vowel-harmony features, place features, all of which have been shown to exhibit autosegmental behavior - then it is no longer clear how the tiers are organized with respect to each other. It is important to notice that this issue is not resolved simply by recourse to a "syllabic core" or CVskeleton, although the orientation this provides is essential. As Pulleyblank (1986) puts it: "Are there limitations on the relating of tiers? Could a tone tier, for example, link directly to a nasality tier? Or could a tone tier link directly to a vowel harmony tier?" 152
6 Michael Broe 6.2 Competing representations
Steriade (1982) shows that in Kolami, partially assimilated clusters (obstruent followed by homorganic nasal) resist epenthesis just as geminates do, suggesting that the output of the assimilation rule should indeed exhibit linked structure (in the case of geminates, illicit *CC codas are simplified by consonant deletion, rather than epenthesis): (7)
Present melp-atun idd-atun porjg-atun
Imperative Gloss melep shake id tell porj boil over
Steriade sketches two possible formulations of this process, which came to be known as multiplanar and coplanar respectively (see Archangeli 1985 for more extensive discussion of the two approaches): (8)
Multiplanar place featuresj
Coplanar
place [_features_
manner
manner
featuresj
[features
place featuresj
place featuresj
manner featuresj
manner |_featuresj
i--
These two solutions induce very different types of representation, and make different empirical predictions. Note first that, in the limiting case, a multiplanar approach would produce a representation in which every feature constituted a separate tier. Elements are associated directly with the skeletal core, ranged about it in a so-called "paddlewheel" or "bottlebrush" formation (Pulleyblank 1986, Sagey 1986b). Here, then, the limiting case would exhibit the complete structural autonomy of every feature: [ coronal ]
(9)
[ consonantal ]
[ sonorant ] [ continuant ] 153
Segment
In the coplanar model, on the other hand, tiers are arrayed in parallel, but within the same plane, with certain primary features associated directly with skeletal slots, and secondary or subordinate features associated with a CVposition only indirectly, mediated through association to the primary features. This form of representation, then, exhibits intrinsic featural dependencies. Compare the coplanar representation in (8) above with the following reformulation: (10)
manner |_featuresj
manner |_featuresj
I place I l^featuresj
F place ~| [_featuresj
—1
Here, counterfactually, place features have been represented as primary, associated directly to the skeletal tier, while manner features are secondary, linked only indirectly to a structural position. But under this regime of feature organization it would be impossible to represent place assimilation independently of manner assimilation. Example (10) is properly interpreted as total assimilation: any alteration in the association of the place features is necessarily inherited by features subordinate to them. The coplanar model thus predicts that assimilation phenomena will display implicational asymmetries. If place features can spread independently of manner features, then they must be represented as subordinate to them in the tree; but we then predict that it will be impossible to spread manner features independently. The limiting case here, then, would be the complete structural dependency of every feature. In the case of vowel-harmony features, for example, Archangeli (1985: 370) concludes: "The position we are led to, then, is that Universal Grammar provides for vowel features arrayed in tiers on a single plane in the particular fashion illustrated below." [back] [ round] [high] V 154
6 Michael Broe
Which model of organization is the correct one? It quickly becomes apparent that both models are too strong: neither complete dependence nor complete independence is correct. The theory of feature geometry brings a complementary perspective to this debate, and provides a solution which displays featural dependence in the right measure. 6.3 The internal structure of the segment
To get a sense of this complementary perspective, we break off from the intellectual history of nonlinear phonology for a moment to consider a related question: what kinds of assimilation process are there in the world's languages? The standard theory of generative phonology provides no answer to this question. The fact that an assimilation process affecting the features, {coronal, anterior, back} is expected, while one affecting {continuant, anterior, voice} is unattested, is nowhere expressed by the theory. In SPE, there is complete formal independence between features: a feature matrix is simply an unstructured bundle of properties. The class of assimilation processes predicted is simply any combination of values of any combination of features. The constant recurrence of particular clusters of features in assimilation rules is thus purely accidental. What is required is some formal way of picking out natural classes of features (Clements 1987), a "built-in featural taxonomy" (McCarthy 1988), thus making substantive claims about possible assimilation processes in the world's languages. There is a further problem with the standard account of assimilation, first pointed out by Bach (1968). Consider the following rule: a cor (3 ant y back
(11)
a cor (3 ant y back
/ /
The alpha-notation, which seems at first sight such an elegant way of capturing assimilatory dependencies, also allows us to express such undesirable dependencies as the following: (12) V
a cor (3 ant y back
P cor y ant a back
/ /
The problem is in fact again due to the complete formal independence of the features involved. This forces us to state assimilation as a set of pairwise agreements, rather than agreement of a set as a whole (Lass 1976: 163). 155
Segment
A possible solution is to introduce an «-ary valued PLACE feature, where the variable [a PLACE] ranges over places of articulation directly. Apart from giving up binarity, the disadvantage of such a solution is that it loses the cross-classificatory characterization of place: retroflex sounds, for example, receiving the description [ret PLACE], would no longer be characterized simultaneously in terms of coronality and anteriority, and a process affecting or conditioned by just one of these properties could not be formulated. Access to a PLACE feature is intuitively desirable, however, and a refinement of this idea is able to meet the above criticisms: we introduce a categoryvalued feature. That is, rather than restricting features to atomic values only (" +," " —," "ret"), we allow a feature matrix itself to constitute the value for some feature. Thus, rather than a representation such as [ret PLACE], we may adopt the following: (13)
+ cor — ant
PLACE
Such a move allows us to preserve thefine-grained,multifeatured characterization of place of articulation; the advantage derives from allowing variables to range over these complex values just as they would over atomic ones. Thus the category (13) above is still matched by the variable [a PLACE]. Category-valued features have been used extensively in syntactic theories (such as lexical-functional grammar [LFG] and generalized phrase-structure grammar [GPSG]) which adopt some form of unification-based representation (Sheiber 1986), and we will adopt such a representation here (see Local, this volume): PLACE =
cor = + ant = —
Here, values have simply been written to the right of their respective features. We may extend the same principle to manner features; and further, group together place and manner features under a superordinate ROOT node, in order to provide a parallel account of total assimilation (that is, in terms of agreement of ROOT specifications): (15)
PLACE =
Fcor = + |_ant = -
ROOT = MANNER =
[ nas
=
—]
156
6 Michael Broe
Such a representation has the appearance of a set of equations between features and their subordinate values, nested in a hierarchical structure. This is in fact more than analogy: any such category constitutes a mathematical function in the strict sense, where the domain is the set of feature names, and the range is the set of feature values. Now just as coordinate geometry provides us with a graphical representation of any function, the better to represent certain salient properties, so we may give a graph-theoretic representation of the feature structure above: (16)
A
ROOT
PLACE
ant
Here, each node represents a feature, and dependent on the feature is its value, be it atomic - as at the leaves of the tree - or category-valued. Formally, then, the unstructured feature matrix of the standard theory has been recast in terms of a hierarchically structured graph, giving direct expression to the built-in feature taxonomy. 6.4 Feature geometry
How is this digression through functions and graphs relevant to the notion of feature geometry? Consider the following question: if (16) above is a segment, what is a string? What is the result of concatenating two such complex objects? A major use of the graph-theoretic modeling of functions in coordinate geometry is to display interactions - points of intersection, for example - between functions. The same holds good for the representations we are considering here. Rather than representing our functions on the x/yaxis of coordinate geometry, however, we arrange them on the temporal axis of feature geometry, adding a third dimension to the representation as in figure 6.1. Such is the mode of representation adopted by feature geometry. Historically, the model developed not as a projection of the graph-theoretic model of the segment, but through an elaboration of the theory of planar organization informed by the concerns of featural taxonomy.2 2
As is usual in formal linguistics, various formalizations of the same leading idea, each with subtly different ramifications, are on offer. The fundamental literature (Clements, 1985; Sagey 1986a) is couched in terms of planar geometry; Local (this volume) adopts a rigorous graphtheoretic approach; while work by Hammond (1988), Sagey (1988), and Bird and Klein (1990) is expressed in terms of algebras of properties and relations. 157
Segment ROOT MANNER
nasal
Figure 6.1 Geometrical representation of two successive segments in a phonological string. When the figure is viewed "end on," the structure of each segment appears as a tree like that in example (16). The horizontal dimension represents time
MANNER PLACE
Figure 6.2 The representation of autosegmental assimilation in a geometrical structure of the sort shown in figure 6.1
The essential insight of feature geometry is to see that, in such a representation, "nodes of the same class or feature category are ordered under the relation of concatenation and define a TIER" (Clements 1985: 248). With respect to such tiers, the familiar processes of nonlinear phonology such as spreading, delinking, and so on, may be formulated in the usual way. Such a representation thus provides a new basis for autosegmental organization. The rule of place assimilation illustrated in (8) above, for example, receives the formulation in figure 6.2. Thus whole clusters of properties may be spread at a single stroke, and yet may do so quite independently of features which are represented on a different tier, in some other subfield of the taxonomy. Clements (1985) helpfully describes these elaborate representations in the following manner: "This conception resembles a construction of cut and glued paper, such that each fold is a class tier, the lower edges are feature tiers, and the upper edge is the CV tier" (p. 229). Feature geometry solves the organization problem mentioned above by adopting a representation in which autosegmental tiers are arranged in a hierarchical structure based on a featural taxonomy. The theory expresses the intimate connection between 158
6 Michael Broe [ROOT
SUPRALARYNGEAL
constricted spread glottis glottis
stiff v.c. RADICAL
round anterior distributed high low back ATR
Figure 6.3 An overview of the hierarchical organization of phonological features, based on Sagey 1986 and others
the internal organization of the segment and the kinds of phonological processes such segments support. As Mohanan (1983) and Clements (1985) point out, the hierarchical organization of autosegmental tiers, combined with a spreading account of assimilation, immediately predicts the existence of three common types of assimilation process in the world's languages: total assimilation processes, in which the spreading element is a root node; partial assimilation processes, in which the spreading element is a class node; and single-feature assimilation, in which a single feature is spread. Clements (1985: 231f.) provides exemplification of this three-way typology of assimilation processes. More complex kinds of assimilation can still be stated, but at greater cost. While the details of the geometry are still under active development, a gross architecture has recently received some consensus. A recurrent shaping idea is that many features are "articulator-bound" (Ladefoged and Halle 1988) and that these should be grouped according to the articulator which executes them: larynx, soft palate, lips, tongue blade, tongue body, tongue root (see Browman and Goldstein, this volume). The features [high], [low], and [back], for example, are gestures of the tongue body, and are grouped together under the DORSAL node. Controversy still surrounds features which are "articulator free" - such as stricture features - and the more abstract major class features [sonorant] and [consonantal]; we will have little to say about such features here (see McCarthy 1988 for discussion). In addition, the geometry reflects the proclivity of features associated with certain articulators to pattern together in assimilatory processes, gathering them under the SUPRALARYNGEAL and PLACE nodes. The picture that emerges 159
Segment
looks like figure 6.3. At the highest level of branching, the framework is able to express Lass's (1976: 152f.) suggestion that feature matrices can be internally structured into laryngeal and supralaryngeal gestures: in certain ways [?] is similar to (the closure phase of) a postvocalic voiceless stop. There is a complete cut-off of airflow through the vocal tract: i.e. a configuration that can reasonably be called voiceless, consonantal (if we rescind the ad hoc restriction against glottal strictures being so called), and certainly noncontinuant. In other words, something very like the features of a voiceless stop, but MINUS SUPRALARYNGEAL ARTICULATION . . . Thus [?] and [h] are DEFECTIVE . . . they are missing an
entire component or parameter that is present in "normal" segments. Moreover, Lass's suggested formalization of this insight bears a striking resemblance to the approach we have sketched above: Let us represent any phonological segment, not as a single matrix, but as containing two submatrices, each specifying one of the two basic parameters: the laryngeal gesture and the oral or supralaryngeal gesture. The general format will be:
r I
[oral] I [laryngeal] I
Lass cites alternations such as the following from Scots: (17)
ka? ba? ba?
cap bat back
ka:r?red3 wen?Ar f\l?Ar
cartridge winter filter
o?m bA?n bro?n
open button broken
and notes that the rule neutralizing [p t k] to [?] can now be formulated simply as deletion of the supralaryngeal gesture: in our terms, delinking of the SUPRALARYNGEAL node. The LARYNGEAL node, then, dominates a set of features concerning the state of the glottis and vocal cords in the various phonation-types. The organization of features below the PLACE node has been greatly influenced by Sagey's (1986a) thesis. This work is largely concerned with complex segments - labiovelars, labiocoronals, clicks, and so on, sounds with more than one place of articulation - and the development of a model which allows the expression of just those combinations which occur in human languages. (These are to be distinguished from the contour segments mentioned above. The articulations within a complex segment are simultaneous at least as far as the phonology is concerned.) It is to be noted, then, that a simple segment will be characterized by just one of the PLACE articulators, while a complex segment will be characterized by more than one - LABIAL and DORSAL, say.
160
6 Michael Broe 6.5 More problems with standard feature theory
The hierarchical view of segment structure also suggests a solution to certain long-standing problems in standard feature theory. Consider the following places of articulation, together with their standard classification: 3 (18) labial
alveolar
palatal
retroflex
velar
uvular
— cor + ant
-hcor + ant
-hcor — ant
+ cor — ant
— cor — ant
— cor — ant
Such an analysis predicts the following natural classes: (19)
[-hcor]: {alveolar, palatal, retroflex} [ — cor]: *{labial, velar, uvular}
But while the [ + cor] class is frequently attested in phonological rules, the [ — cor] class is never found. The problem here is that standard feature theory embodies an implicit claim that, if one value of a feature denotes a natural class, then so will the opposite value. This is hard-wired into the theory: it is impossible to give oneself the ability to say [ + F] without simultaneously giving oneself the ability to say [ — F]. Consider n o w a classification based on active articulators: (20)
labial
alveolar
palatal
retroflex
velar
uvular
LAB
COR
COR
COR
DOR
DOR
Such a theory predicts the following classes: (21)
[LABIAL]: {labial}
[CORONAL]: {alveloar, palatal, retroflex} [DORSAL]: {velar, uvular}
Under this approach, the problematic class mentioned above simply cannot be mentioned - the desired result. Consider now the same argument with respect to the feature [anterior]; we predict the following classes: (22)
[-fant]: *{labial, alveolar} [ — ant]: *{palatal, retroflex, velar, uvular}
Here the problem is even worse. As commonly remarked in the literature, there seems to be no phonological process for which this feature denotes a 3
The following discussion is based on Yip (1989). 161
Segment
natural class, [anterior] is effectively an ancillary feature, whose function is the subclassification of [ + coronal] segments, not the cross-classification of the entire consonant inventory. We can express this ancillary status in a hierarchical model by making the feature [anterior] subordinate to the CORONAL node: (23) labial
alveolar
palatal
retroflex
velar
uvular
LAB
COR
COR
COR
DOR
DOR
1
1
1
+ ant
— ant
— ant
Similarly, the feature [distributed] is also represented as a dependent of the coronal node, distinguishing retroflex sounds. 6.6 An example
This organization of the PLACE articulators can be seen in action in an account of Sanskrit w-Retroflexion or ISTati (Schein and Steriade 1986). The rule, simplified, may be quoted direct from Whitney: 189. The dental nasal n . . . is turned into the lingual [i.e. retroflex] i\ if preceded in the same word by . . . § or r.-: and this, not only if the altering letter stands immediately before the nasal, but at whatever distance from the latter it may be found: unless, indeed, there intervene (a consonant moving the front of the tongue: namely) a palatal . . . , a lingual or a dental. (Whitney 1889: 64) N o t e Whitney's attention to the active articulator in his formulation of the rule: " a consonant moving the front of the t o n g u e . " The rule is exemplified in the data in table 6.1 (Schein and Steriade 1986). We may represent this consonant harmony as the autosegmental spreading of the CORONAL node, targeting a coronal nasal, as shown in figure 6.4.
+ nas
-ant
—dist
Figure 6.4 Sanskrit w-retroflexion (Nati) expressed as spreading of coronal node 162
6 Michael Broe Table 6.1 Data exemplifying the Sanskrit n-Retroflexion rule shown in figure 6.4 Base form
Nati form
Gloss
Present 1 mr;d-naa2 3
i§-r\aaPt-n.aa-
be gracious seek fill
Passive 4 bhug-na5 6
puu|;-i\avr.k-r(a-
bend fill cut up
Middle participle 1 7 marj-aana8 k§ved-aana9 10 11
pur,-aar\ak§ubh-aar\acak§-aar\a-
wipe hum fill quake see
Middle participle 2 12 krt-a-maana13
kr,p-a-maai\a-
cut lament
If a labial, velar, or vowel intervenes - segments characterized by LABIAL and DORSAL articulators - no ill-formedness results, and the rule is free to apply. If, however, a coronal intervenes, the resultant structure will be ruled out by the standard no-crossing constraint of autosegmental phonology. The featural organization thus explains why non-coronals are transparent to the coronal harmony (see figure 6.5). This accords nicely with Whitney's account: We may thusfigureto ourselves the rationale of the process: in the marked proclivity of the language toward lingual utterance, especially of the nasal, the tip of the tongue, when once reverted into the loose lingual position by the utterance of non-contact lingual element, tends to hang there and make its next nasal contact in that position: and does so, unless the proclivity is satisfied by the utterance of a lingual mute, or the organ is thrown out of adjustment by the utterance of an element which causes it to assume a different posture. This is not the case with the gutturals or labials, which do not move the front part of the tongue. (Whitney 1879: 65) Note that, with respect to the relevant
(CORONAL)
163
tier, target and trigger are
Segment r
- - -
LARYNGEAL
SUPRALARYNGEAL PLACE LABIAL
I '
CORONAL anterior distributed
Figure 6.5 The operation of the rule shown in figure 6.4, illustrating the transparency of an intervening labial node
adjacent. This gives rise to the notion that all harmony rules are "local" in an extended sense: adjacent at some level of representation. It may be helpful to conclude with an analogy from another cognitive faculty, one with a long and sophisticated history of notation: music. The following representation is a perceptually accurate transcription of a piece of musical data: (24)
i In this transcription, x and y are clearly nonadjacent. But now consider this performance score, where the bass line is represented "autosegmentalized" on a separate tier: (25)
Here, x and y are adjacent, on the relevant tier. Note, too, that there is an articulatory basis to this representation, being a transcription of the gestures of the right and left hands respectively: x and y are adjacent in the left hand. An articulator-bound notion of feature geometry thus lends itself naturally to a conception of phonological representation as gestural score (Browman and Goldstein 1989 and this volume). 164
6 Michael Broe
The most notable achievement of feature geometry, then, is the synthesis it achieves between a theory of feature classification and taxonomy, on the one hand, and the theory of autosegmental representation - in its extended application to segmental material - on the other. Now as Chomsky (1965: 172) points out, the notion of the feature matrix as originally conceived gives direct expression to the notion of "paradigm" - the system of oppositions operating at a given place in structure: "The system of paradigms is simply described as a system of features, one (or perhaps some hierarchic configuration) corresponding to each of the dimensions that define the system of paradigms." Feature geometry makes substantive proposals regarding the "hierarchic configuration" of the paradigmatic dimension in phonology. But it goes further, and shows what kind of syntagmatic structure such a hierarchy supports. A new syntagm, a new paradigm.
165
7 The segment: primitive or derived? JOHN J. OHALA
7.1 Introduction
The segmental or articulated character of speech has been one of the cornerstones of phonology since its beginnings some two-and-a-half millennia ago.* Even though segments were broken down into component features, the temporal coordination of these features was still regarded as a given. Other common characteristics of the segment, not always made explicit, are that they have a roughly steady-state character (or that most of them do), and that they are created out of the same relatively small set of features used in various combinations. Autosegmental phonology deviates somewhat from this by positing an underlying representation of speech which includes autonomous features (autosegments) uncoordinated with respect to each other or to a CV core or "skeleton" which is characterized as "timing units."1 These autonomous features can undergo a variety of phonological processes on their own. Ultimately, of course, the various features become associated with given Cs or Vs in the CV skeleton. These associations or linkages are supposed to be governed by general principles, e.g. left-to-right mapping (Goldsmith 1976), the obligatory contour principle (Leben 1978), the shared feature convention (Steriade 1982). These principles of association are "general" in the sense *I thank Bjorn Lindblom, Nick Clements, Larry Hyman, John Local, Maria-Josep Sole, and an anonymous reviewer for helpful comments on earlier versions of this paper. The program which computed the formant frequencies of the vocal-tract shapes in figure 7.2 was written by Ray Weitzman, based on earlier programs constructed by Lloyd Rice and Peter Ladefoged. A grant from the University of California Committee on Research enabled me to attend and present this paper in Edinburgh. 1 As far as I have been able to tell, the terms "timing unit" or "timing slot" are just arbitrary labels. There is no justification to impute a temporal character to these entities. Rather, they are just "place holders" for the site of linkage that the autosegments eventually receive. 166
7 John J. Ohala
that they do not take into account the "intrinsic content" of the features (Chomsky and Halle 1968: 400ff.); the linkage would be the same whether the autosegments were [± nasal] or [ + strident]. Thus autosegmental phonology preserves something of the traditional notion of segment in the CV-tier but this (auto)segment at the underlying level is no longer determined by the temporal coordination of various features. Rather, it is an abstract entity (except insofar as it is predestined to receive linkages with features proper to vowels or consonants). Is the primitive or even half-primitive nature of the segment justified or necessary? I suggest that the answer to this question is paradoxically both "no" and "yes": "no" from an evolutionary point of view, but "yes" in every case after speech became fully developed; this latter naturally includes the mental grammars of all current speakers. I will argue that it is impossible to have articulated speech, i.e., with "segments," without having linked, i.e. temporally coordinated, features. However, it will be necessary to justify separately the temporal linkage of features, the existence of steady-states, and the use of a small set of basic features; it will turn out that these characteristics do not all occur in precisely the same temporal domain or "chunk" in the stream of speech. Thus the "segment" derived will not correspond in all points to the traditional notion of segment. For the evolutionary part of my story I am only able to offer arguments based primarily on the plausibility of the expected outcome of a "gedanken" simulation; an actual simulation of the evolution of speech using computer models has not been done yet.2 However, Lindblom (1984, 1989) has simulated and explored in detail some aspects of the scenario presented here. Also relevant is a comparison of speech-sound sequences done by Kawasaki (1982) and summarized by Ohala and Kawasaki (1984). These will be discussed below. In any case, much of my argument consists of bringing wellknown phonetic principles to bear on the issue of how speech sounds can be made different from each other - the essential function of the speech code. 7.2 Evolutionary development of the segment
7.2.1 Initial conditions Imagine a prespeech state in which all that existed was the vocal tract and the ear (including their neurological and neuromuscular underpinnings). The vocal tract and the ear would have the same physical and psychophysical constraints that they have now (and which presumably can be attributed to 2
A computational implementation of the scenario described was attempted by Michelle Caisse but was not successful due to the enormity of the computations required and the need to define more carefully the concept of "segmentality," the expected outcome. 167
Segment
natural physical and physiological principles and the constraints of the ecological niche occupied by humans). We then assign the vocal tract the task of creating a vocabulary of a few hundred different utterances (words) which have the following properties: 1 They must be inherently robust acoustically, that is, easily differentiate from the acoustic background and also sufficiently different from each other. 1 will refer to both these properties as "distinctness". Usable measures of acoustic distinctness exist which are applicable to all-voiced speech with no discontinuities in its formant tracks; these have been applied to tasks comparable to that specified here (Kawasaki 1982; Lindblom 1984; Ohala et al. 1984). Of course, speech involves acoustic modulations in more than just spectral pattern; there are also modulations in amplitude, degree of periodicity, rate of periodicity (fundamental frequency), and perhaps other parameters that characterize voice quality. Ultimately all such modulations have to be taken into account. 2 Errors in reception are inevitable, so it would be desirable to have some means of error correction or error reduction incorporated into the code. 3 The rate and magnitude of movements of the vocal tract must operate within its own physical constraints and within the constraints of the ear to detect acoustic modulations. What I have in mind here is, first, the observation that the speech organs, although having no constraint on how slowly they can move, definitely have a constraint on how rapidly they can move. Furthermore, as with any muscular system, there is a trade-off between amplitude of movement and the speed of movement; the movements of speech typically seem to operate at a speed faster than that which would permit maximal amplitude of movement but much slower than the maximal rate of movement (Ohala 1981b, 1989). (See McCroskey 1957; Lindblom 1983; Lindblom and Lubker 1985 on energy expenditure during speech.) On the auditory side, there are limits to the magnitude of an optimal acoustic modulation, i.e., any change in a sound. Thus, as we know from numerous psychophysical studies, very slow changes are hardly noticeable and very rapid changes present a largely indistinguishable "blur" to the ear. There is some optimal range of rates of change in between these extremes (see Licklider and Miller 1951; Bertsch et al. 1956). Similar constraints govern the rate of modulations detectable by other sense modalities and show up in, e.g., the use of flashing lights to attract attention. 4 The words should be as short as possible (and we might also establish an upper limit on the length of a word, say, 1 sec). This is designed to prevent a vocabulary where one word is /ba/, another /baba/, another /b9b9b9/ etc., 168
7 John J. Ohala
with the longest word consisting of a sequence of n /b9/s where n = the size of the vocabulary. 7.2.2 Anticipated results 7.2.2.1 Initially: random constrictions and expansions What will happen when such a system initially sets out to make the required vocabulary? Notice that no mention has been made of segments, syllables or any other units aside from "word" which in this case is simply whatever happens between silences. One might imagine that it would start by making sequences of constrictions and expansions randomly positioned along the vocal tract, which sequences had some initially chosen arbitrary time duration. At the end of this exercise it would apply its measure of accoustic robustness and distinctness and, if the result was unsatisfactory, proceed to create a second candidate vocabulary, now trying a different time duration and a different, possibly less random, sequence of constrictions and expansions and again apply its acoustic metric, and so on until the desired result was achieved. Realistically the evaluation of a vocabulary occurs when nonrobust modulations are weeded out because listeners confuse them, do not hear them, etc., and replace them by others. Something of this sort may be seen in the loss of [w] before back rounded vowels in the history of the pronunciations of English words sword (now [sojd]), swoon (Middle English sun), and ooze (from Old English wos) (Dobson 1968: 979ff.). The acoustic modulation created when going from [w] to back rounded vowel is particularly weak in that it involves little change in acoustic parameters (Kawasaki 1982). My prediction is that this system would "discover" that segments were necessary to its task, i.e. that segments, far from being primitives, would "fall out" as a matter of course, given the initial conditions and the task constraints. My reasons for this speculation are as follows. 7.2.2.2 Temporal coordination The first property that would evolve is the necessity for temporal coordination between different articulators. A single articulatory gesture (constriction or expansion) would create an acoustic modulation of a certain magnitude, but the system would find that by coordinating two or more such gestures it could create modulations that were greater in magnitude and thus more distinct from other gestures. A simple simulation will demonstrate this. Figure 7.1 shows a possible vowel space defined by the frequencies of the first two formants. To relate this figure to the traditonal vowel space, the peripheral vowels from the adult male vowels reported by Peterson and Barney (1952) are given as filled 169
Segment
2,000
1,000--
800 200
500 Formant 1
Figure 7.1 Vowel space with five hypothetical vowels corresponding to the vocal-tract configurations shown in figure 7.2. Abscissa: Formant 1; ordinate: Formant 2. For reference, the average peripheral vowels produced by adult male speakers, as reported by Peterson and Barney (1952) is shown by filled squares connected by solid lines
squares connected by solid lines; hypothetical vowels produced by the shapes given in figure 7.2 are shown as filled circles. Note that the origin is in the lower left corner, thus placing high back vowels in the lower left, high front vowels in the upper left, and low vowels on the far right. Point 1 marks the formant frequencies produced by a uniform vocal tract of 17 cm length, i.e. with equal cross-dimensional area from glottis to lips. Such a tract is represented schematically as " 1 " in figure 7.2. A constriction at the lips (schematically represented as "2" in figure 7.2) would yield the vowel labelled 2. If vowel 1 is the "neutral" central vowel, then vowel 2 is somewhat higher and more back. Now if, simultaneous with the constriction in vowel 2, a second constriction were made one-third of the way up from the glottis, approximately in the uvular or upper pharyngeal region (shown schematically as " 3 " in figure 7.2) vowel 3 would result. This is considerably more back and higher than vowel 2. As is well known (Chiba and Kajiyama 1941; Fant 1960) we get this effect by placing constrictions at both of the nodes in the pressure standing wave (or equivalently, the antinodes in the velocity standing wave) for the second resonance of the vocal tract. (See also Ohala and Lorentz 1977; Ohala 1979a, 1985b.) Consider another case. Vowel 4 results when a constriction is made in the palatal region. With respect to vowel 1, it is somewhat higher. But if a pharyngeal expansion is combined with the palatal constriction, as in vowel 5, we get a vowel that is, in fact, maximally front and high. 170
7 John J. Ohala
Lips
Glottis
Figure 7.2 Five hypothetical vocal-tract shapes corresponding to the formant frequency positions in figure 7.1. Vertical axis: vocal-tract cross-dimensional area; horizontal axis: vocaltract length from glottis (right) to lips (left). See text for further explanation
This system, I maintain, will discover that coordinated articulations are necessary in order to accomplish the task of making a vocabulary consisting of acoustically distinct modulations. This was illustrated with vowel articulations, but the same principle would apply even more obviously in the case of consonantal articulations. In general, manner distinctions (the most robust cues for which are modulations of the amplitude envelope of the speech signal) can reach extremes only by coordinating different articulators. Minimal amplitude during an oral constriction requires not only abduction of the vocal cords but also a firm seal at the velopharyngeal port. Similar arguments can be made for other classes of speech sounds. In fact, there is a growing body of evidence that what might seem like quite distant and anatomically unlinked articulatory events actually work together, presumably in order to create an optimal acoustic-auditory signal. For example, Riordan (1977) discovered interactions between lip rounding and larynx height, especially for rounded vowels. Sawashima and Hirose (1983) have discovered different glottal states for different manners of articulation: a voiceless fricative apparently has a wider glottis than a comparable voiceless stop does. Lofqvist et al. (1989) find evidence of differing tension - not simply the degree of abduction - in the vocal cords between voiced and voiceless obstruents. It is well known, too, that voiceless obstruents have a greater closure duration than cognate voiced obstruents 171
Segment
(Lehiste 1970: 28); thus there is interaction between glottal state and the overall consonantal duration. The American English vowel [a*], which is characterized by the lowest third formant of any human vowel, has three constrictions: labial, mid-palatal, and pharyngeal (Uldall 1958; Delattre 1971; Ohala 1985b). These three locations are precisely the locations of the three antinodes of the third standing wave (the third resonance) of the vocal tract. In many languages the elevation of the soft palate in vowels is correlated with vowel height or, what is probably more to the point, inversely correlated with the first formant of the vowel (Lubker 1968; Fritzell 1969; Ohala 1975). There is much cross-linguistic evidence that [u]-like vowels are characterized not only by the obvious lip protrusion but also by a lowered larynx (vis-a-vis the larynx position for a low vowel like [a]) (Ohala and Eukel 1987). Presumably, this lengthening of the vocal tract helps to keep the vowel resonances as low as possible and thus maximally distinct from other vowels. As alluded to above, it is well known in sensory physiology that modulations of stimulus parameters elicit maximum response from the sensory receptor systems only if they occur at some optimal rate (in time or space, depending on the sense involved). A good prima facie case can be made that the speech events which come closest to satisfying this requirement for the auditory system are what are known as "transitions" or the boundaries between traditional segments, e.g. bursts, rapid changes in formants and amplitude, changes from silence to sound or from periodic to aperiodic excitation and vice versa. So all that has been argued for so far is that temporally coordinated gestures would evolve - including, perhaps, some acoustic events consisting of continuous trajectories through the vowel space, clockwise and counterclockwise loops, S-shaped loops, etc. These may not fully satisfy all of our requirements for the notion of "segment," so other factors, discussed below, must also come into play, 7.2.2.3 "Steady-state" Regarding steady-state segments, several things need to be said. First of all, from an articulatory point of view there are few if any true steady-state postures adopted by the speech organs. However, due to the nonlinear mapping from articulation to aerodynamics and to acoustics there do exist near steady-states in these latter domains.3 In most cases the reason for this 3
Thus the claim, often encountered, that the speech signal is continuous, that is, shows few discontinuities and nothing approximating steady-states in between (e.g. Schane 1973: 3; Hyman 1975: 3), is exaggerated and misleading. The claim is largely true in the articulatory domain (though not in the aerodynamic domain). And it is true that in the perceptual domain the cues for separate segments or "phonemes" may overlap, but this by itself does not mean that the perceptual signal has no discontinuities. The claim is patently false in the acoustic domain as even a casual examination of spectrograms of speech reveals.
172
7 John J. Ohala
nonlinear relationship is not difficult to understand. Given the elasticity of the tissue and the inertia of the articulators, during a consonantal closing gesture the articulators continue to move even after complete closure is attained. Nevertheless, for as long as the complete closure lasts it effectively attenuates the output sound in a uniform way. Other parts of the vocal tract can be moving and still there will be little or no acoustic output to reveal it. Other nonlinearities govern the creation of steady-states or near-steadystates for other types of speech events (Stevens 1972, 1989). But there may be another reason why steady-states would be included in the speech signal. Recall the task constraint that the code should include some means for error correction or error reduction. Benoit Mandelbrot (1954) has argued persuasively that any coded transmission subject to errors could effect error reduction or at least error limitation by having "breakpoints" in the transmission. Consider the consequences of the alternative, where everything transmitted in between silence constituted the individual cipher. An error affecting any part of that transmission would make the entire transmission erroneous. Imagine, for example, a Morse-code type of system which for each of the 16 million possible sentences that could be conveyed had a unique string of twenty-four dots and dashes. An error on even one of the dots and dashes would make the whole transmission fail. On the other hand if the transmission had breakpoints often enough, that is, places where what had been transmsitted so far could be decoded, then any error could be limited to that portion and it would not nullify the whole of the transmission. Checksums and other devices in digital communications are examples of this strategy. I think the steady-states that we find in speech, from 50 to 200 msec, or so in duration, constitute the necessary "dead" intervals or breakpoints that clearly demarcate the chunks with high information density. During these dead intervals the listener can decode these chunks and then get ready for the subsequent chunks. What I am calling "dead" intervals are, of course, not truly devoid of information but I would maintain that they transmit information at a demonstrably lower rate than that during the rapid acoustic modulations they separate. This, in fact, is the interpretation I give to the experimental results of Ohman (1966b) and Strange, Verbrugge, and Edman (1976). It must be pointed out that if there is a high amount of redundancy in the code, which is certainly true of any human language's vocabulary, then the ability to localize an error of transmission allows error correction, too. Hearing "skrawberry" and knowing that there is no such word while there is a word strawberry allows us to correct a (probable) transmission error. I believe that these chunks or bursts of high-density information flow are what we call "transitions" between phonemes. I would maintain that these are the kind of units required by the constraints of the communication task. 173
Segment
These are what the speaker is intending to produce when coordinating the movements of diverse articulators4 and these are what the listener attends to. Nevertheless, these are not equivalent to our traditional conception of the "segment." The units arrived at up to this point contain information on a sequential pair of traditional segments. Furthermore, the inventory of such units is larger than the inventory of traditional segments by an order of magnitude. Finally, what I have called the "dead interval" between these units is equivalent to the traditional segment (the precise boundaries may be somewhat ambiguous but that, in fact, corresponds to reality). I think that our traditional conception of the segment arises from the fact that adjacent pairs of novel segments, i.e. transitions, are generally correlated. For example, the transition found in the sequence /ab/ is almost invariably followed by one of a restricted set of transitions, those characteristic of/bi/, /be/, /bu/, etc., but not /gi/, /de/. As it happens, this correlation between adjacent pairs of transitions arises because it is not so easy for our vocal tract to produce uncorrelated transitions: the articulator that makes a closure is usually the same one that breaks the closure. The traditional segment, then, is an entity constructed by speakers-listeners; it has a psychological reality based on the correlations that necessarily occur between successive pairs of the units that emerge from the underlying articulatory constraints. The relationship between the acoustic signal, the transitions which require the close temporal coordination between articulators, and the traditional segments is represented schematically in figure 7.3. 7.2.2.4 Features
If an acoustically salient gesture is "discovered" by combining labial closure, velic elevation, and glottal abduction, will the same velic elevation and glottal abduction be "discovered" to work well with apical and dorsal closures? Plausibly, the system should also be able to discover how to "recycle" features, especially in the case of modulations made distinct by the combination of different "valves" in the vocal tract. There are, after all, very few options in this respect: glottis, velum, lips, and various actions of the tongue (see also Fujimura 1989b). A further limitation exists in the options available for modulating and controlling spectral pattern by virtue of the fact that the standing wave patterns of the lowest resonances have nodes and 4
The gestures which produce these acoustic modulations may require not only temporal coordination between articulators but also precision in the articulatory movements themselves. This may correspond to what Fujimura (1986) calls "icebergs": patterns of temporally localized invariant articulatory gestures separated by periods where the gestures are more variable. 174
7 John J. Ohala
(a)
ep
ik
pi
ka
(c)
Time —*
Figure 7.3 Relationship between acoustic speech signal (a), the units with high-rate-ofinformation transmission that require close temporal coordination between articulators (b), and the traditional segment (c)
antinodes at discrete and relatively few locations in the vocal tract (Chiba and Kajiyama 1941; Fant 1960; Stevens 1972, 1989): an expansion of the pharynx would serve to keep F} as low as possible when accompanying a palatal constriction (for an [i]) as well as when accompanying simultaneous labial and uvular constrictions (for an [u]) due to the presence there of an antinode in the pressure standing wave of the lowest resonance.5 Having said this, however, it would be well not to exaggerate (as phonologists often do) the similarity in state or function of what is considered to be the "same" feature when used with different segments. The same velic coupling will work about as well with a labial closure as an apical one to create [m] and [n] but as the closure gets further back the nasal consonants that result get progressively less consonantal. This is because an 5
Pharyngeal expansion was not used in the implementation of the [u]-like vowel 3 in figure 7.1, but if it had been it would have approached more closely the corner vowel [u] from the Peterson and Barney study. 175
Segment
important element in the creation of a nasal consonant is the "cul-de-sac" resonating cavity branching off the pharyngeal-nasal resonating cavity. This "cul-de-sac" naturally gets shorter and acts less effectively as a separate cavity the further back the oral closure is (Fujimura 1962; Ohala 1975, 1979a, b; Ohala and Lorentz 1977). I believe this accounts for the lesser incidence or more restricted distribution of [rj] in the sound systems of the languages of the world6. Similarly, although a stop burst is generally a highly salient acoustic event, all stop bursts are not created equal. Velar and apical stop bursts have the advantage of a resonating cavity downstream which serves to reinforce their amplitude; this is missing in the case of labial stop bursts. Accordingly, among stops that rely heavily on bursts, i.e. voiceless stops (pulmonic or glottalic), the labial position is often unused, has a highly restricted distribution, or simply occurs less often in running speech (Wang and Crawford 1960; Gamkrelidze 1975; Maddieson 1984: ch. 2). The more one digs into such matters, the more differences are found in the "same" feature occurring in different segments: as mentioned above, Sawashima and Hirose have found differences in the character of glottal state during fricatives vis-a-vis cognate stops. The conclusion to draw from this is that what matters most in speech communication is making sounds which differ from each other; it is less important that these be made out of recombinations of the same gestures used in other segments. The orderly grid-like systems of oppositions among the sounds of a language which one finds especially in Prague School writings (Trubetzkoy 1939 [1969]) are quite elusive when examined phonetically. Instead, they usually exhibit subtle or occasionally not-so-subtle asymmetries. Whether one can make a case for symmetry phonologically is another matter but phonologists cannot simply assume that the symmetry is self-evident in the phonetic data. 7.2.2.5 Final comment on the preceding evolutionary scenario
I have offered plausibility arguments that some of the properties we commonly associate with the notion "segment," i.e. temporal coordination of articulators, steady-states, and use of a small set of combinable features, are derivable from physical and physiological constraints of the speaking and hearing mechanisms in combination with constraints of the task of forming a vocabulary. High bit-rate transitions separated by "dead" intervals are suggested to be the result of this effort. The traditional notion of the "segment" itself - which is associated with the intervals between the 6
Given that [rj] is much less "consonantal" than other nasal consonants and given its long transitions (which it shares with any velar consonant), it is often more a nasalized velar glide or even a nasalized vowel. I think this is the reason it often shows up as an alternant of, or substitute for, nasalized vowels in coda position, e.g., in Japanese, Spanish, Vietnamese. See Ohala (1975). 176
7 John J. Ohala
transitions - is thought to be derived from the probabilities of cooccurrence of successive transitions. It is important to note that temporal coordination of articulators is a necessary property of the transitions not of the traditional segment. The evolutionary scenario presented above is admittedly speculative. But the arguments regarding the necessity for temporal coordination of articulators in order to build a vocabulary exhibiting sufficient contrast are based on well-known phonetic principles and have already been demonstrated in numerous efforts at articulatory-based synthesis. 7.3 Interpretation
7.3.1 Segment is primitive now If the outcome of this "gedanken" simulation is accepted, then it must also be accepted that spoken languages incorporate, indeed, are based on, the segment. To paraphrase Voltaire on God's existence: if segments did not exist, we would have invented them (and perhaps we did). Though not a primitive in the prespeech stage, it is a primitive now, that is, from the point of view of anyone having to learn to speak and to build a vocabulary. All of the arguments given above for why temporal coordination between articulators was necessary to create an optimal vocabulary would still apply for the maintenance of that vocabulary. I suggest, then, that autosegmental phonology's desegmentalization of speech, especially traditional segmental sounds (as opposed to traditional suprasegmentals) is misguided. Attempts to link up the features or autosegments to the "time slots" in the CV tier by purely formal means (left-to-right association, etc.) are missing an important function of the segment. The linkages or coordination between features are there for an important purpose: to create contrasts, which contrasts exploit the capabilities of the speech apparatus by coordinating gestures at different locations within the vocal tract. Rather than being linked by purely formal means that take no account of the "intrinsic content" of the features, the linkages are determined by physical principles. If features are linked of necessity then they are not autonomous. (See also Kingston [1990]; Ohala [1990b].) It is true that the anatomy and physiology of the vocal tract permit the coordination between articulators to be loose or tight. A relatively loose link is found especially between the laryngeal gestures which control the fundamental frequency (Fo) of voice and the other articulators. Not coincidentally, it was tone and intonation that were the first to be autosegmentalized (by the Greeks, perhaps; see below). Nevertheless, even Fo modulations have to be coordinated with the other speech events. Phonologically, there are cases 177
Segment
where tone spreading is blocked by consonants known to perturb Fo in certain ways (Ohala 1982 and references cited there). Phonetically, it has been demonstrated that the Fo contours characteristic of word accent in Swedish are tailored in various ways so that they accommodate to the voiced portions of the syllables they appear with (Erikson and Alstermark 1972; Erikson 1973). Evidence of a related sort for the Fo contours signaling stress in English has been provided by Steele and Liberman (1987). Vowel harmony and nasal prosodies, frequently given autosegmental treatment, also do not show themselves to be completely independent of other articulations occurring in the vocal tract. Vowel harmony shows exceptions which depend on the phonetic character of particular vowels and consonants involved (Zimmer 1969; L. Anderson 1980). Vowel harmony that is presumably still purely phonetic (i.e., which has not yet become phonologized) is observable in various languages (Ohman 1966a; Yaeger 1975) but these vowel-on-vowel effects are (a) modulated by adjacent consonants and (b) are generally highly localized in their domain, being manifested just on that fraction of one vowel which is closest to another conditioning vowel. In this latter respect, vowel-vowel coarticulation shows the same temporal limitation characteristic of most assimilations: it does not spread over unlimited domains. (I discuss below how assimilations can enlarge their temporal domain through sound change, i.e. the phonologization of these short-span phonetic assimilations.) Assimilatory nasalization, another process that can develop into a trans-syllabic operation, is sensitive to whether the segments it passes through are continuant or not and, if noncontinuant, whether they have a constriction further forward of the uvula (Ohala 1983). All of this, I maintain, gives evidence for the temporal linkage of features.7 7.3.2 Possible counterarguments
Let me anticipate some counterarguments to this position. 7.3.2.1 Feature geometry It might be said that some (all?) of the interactions between features will be taken care of by so-called "feature geometry" (Clements 1985; McCarthy 1989) which purports to capture the corelatedness between features through a hierarchical structure of dependency relationships. These dependencies are said to be based on phonetic considerations. I am not optimistic that the interdependencies among features can be adequately represented by any network that posits a simple, asymmetric, transitive, dependency relationship 7
Though less common, there are also cases where spreading nasalization is blocked by certain continuants, too; see Ohala (1974, 1975).
178
7 John J. Ohala
between features. The problem is that there exist many different types of physical relationships between the various features. Insofar as a phonetic basis has been considered in feature geometry, it is primarily only that of spatial anatomical relations. But there are also aerodynamic and acoustic relations, and feature geometry, as currently proposed, ignores these. These latter domains link anatomically distant structures. Some examples (among many that could be cited): simultaneous oral and velic closures inhibit vocalcord vibration; a lowered soft palate not only inhibits frication and trills in oral obstruents (if articulated at or further forward of the uvula) but also influences the Fj (height) of vowels; the glottal state of high airflow segments (such as /s/, /p/), if assimilated onto adjacent vowels, creates a condition that mimics nasalization and is apparently reinterpreted by listeners as nasalization; labial-velar segments like [w,kp] pattern with plain labials ([ + anterior]) when they influence vowel quality or when frication or noise bursts are involved, but they frequently pattern like velars ([ — anterior]) when nasal consonants assimilate to them; such articulatorily distant and disjoint secondary articulations as labialization, retroflexion, and pharyngealization have similar effects on high vowels (they centralize [i] maximally and have little effect on [u]) (Ohala 1976, 1978, 1983, 1985a, b; Beddor, Krakow, and Goldstein 1986; Wright 1986). I challenge the advocates of "feature geometry" to represent such crisscrossing and occasionally bidirectional dependencies in terms of asymmetric, transitive, relationships. In any case, the attempt to explain these and a host of other dependencies other than by reference to phonetic principles will be subject to the fundamental criticism: even if one can devise a formal relabeling of what does happen in speech, one will not be able to show in principle - that is, without ad hoc stipulations - why certain patterns do not happen. For example, why should [ + nasal] affect primarily the feature [high] in vowels and not the feature [back]? Why should [ — continuant] [ — nasal] inhibit [ +voice] instead of [ — voice]? 73.2.2 Grammar, not physics Also, it might legitimately be objected that the arguments I have offered so far have been from the physical domain, whereas autosegmental representations have been posited for speakers' grammars, i.e. in the psychological domain.8 The autosegmental literature has not expended much effort gathering and evaluating evidence on the psychological status of autoseg8
There is actually considerable ambiguity in current phonological literature as to whether physical or psychological claims are being made or, indeed, whether the claims should apply to both domains or neither. I have argued in several papers that phonologists currently assign some events that are properly phonetic to the psychological domain (Ohala 1974, 1985b, forthcoming). But even this is not quite so damaging as assigning to the synchronic psychological domain events which properly belong to a language's history. 179
Segment
ments, but there is at least some anecdotal and experimental evidence that can be cited and it is not all absolutely inconsistent with the autosegmental position (though, I would maintain, it does not unambiguously support it either). Systematic investigation of the issues is necessary, though, before any confident conclusions may be drawn. Even outside of linguistics analyzers of song and poetry have for millennia extracted metrical and prosodic structures from songs and poems. An elaborate vocabulary exists to describe these extracted prosodies, e.g. in the Western tradition the Greeks gave us terms and concepts such as iamb, trochee, anapest, etc. Although worth further study, it is not clear what implication this has for the psychological reality of autosegments. Linguistically naive (as well as linguistically sophisticated) speakers are liable to the reification fallacy. Like Plato, they are prone to regard abstract concepts as real entities. Fertility, war, learning, youth, and death are among the many fundamental abstract concepts that people have often hypostatized, sometimes in the form of specific deities. Yet, as we all know, these concepts only manifest themselves when linked with specific concrete people or objects. They cannot "float" as independent entities from one object to another. Though more prosaic than these (so to speak), is iamb any different? Are autosegments any different? But even if we admit that ordinary speakers are able to form concepts of prosodic categories paralleling those in autosegmental phonology, no culture, to my knowledge, has shown an awareness of comparable concepts involving, say, nasal (to consider one feature often treated autosegmentally). That is, there is no vocabulary and no concept comparable to iamb and trochee for the opposite patterns of values for [nasal] in words like dam vs. mid or mountain vs. damp. The concepts and vocabulary that do exist in this domain concerning the manipulation of nonprosodic entities are things like rhyme, alliteration, and assonance, all of which involve the repetition of whole segments. Somewhat more to the point, psychologically, is evidence from speech errors, word games and things like "tip of the tongue" (TOT) recall. Errors of stress placement and intonation contour do occur (Fromkin 1976; Cutler 1980), but they are often somewhat difficult to interpret. Is the error of ambiguty for the target ambiguity a grafting of the stress pattern from the morphologically related word ambiguous (which would mean that stress is an entity separable from the segments it sits on) or has the stem of this latter word itself intruded? Regarding the shifting of other features, including [nasal] and those for places of articulation, there is some controversy. Fromkin (1971) claimed there was evidence of feature interchange, but Shattuck-Hufnagel and Klatt (1979) say this is rare - usually whole bundles of features, i.e. phonemes, are what shift. Hombert (1986) has demonstrated 180
7 John J. Ohala
using word games that the tone and vowel length of words can, in some cases (but not all), be stripped off the segments they are normally realized on and materialized in new places. In general, though, word games show that it is almost invariably whole segments that are manipulated, not features. TOT recall (recall of some aspects of the pronunciation of a word without full retrieval of the word) frequently exhibits awareness of the prosodic character of the target word (including the number of syllables, as it happens; Brown and McNeill 1966; Browman 1978). Such evidence is suggestive but unfortunately, even when knowledge of features is demonstrated, it does not provide crucial evidence for differentiating between the psychological reality of traditional (mutually linked) features and autonomous features, i.e., autosegments. There is as yet no hitching post for autosegmental theory in this data. 7.3.2.3 Features migrate across segment boundary lines
If, as I maintain, there is temporal coordination between features in order to create contrasts, how do I account for the fact that features are observed to spill over onto adjacent segments as in assimilation? My answer to this is first to reemphasize that the temporal coordination occurs on the transitions, not necessarily during the traditional segment. Second, "coordination" does not imply that the participating articulators all change state simultaneously at segment boundaries, rather, that at the moment the rapid acoustic modulation, i.e. the transition, is to occur, e.g., onset of a postvocalic [3], the various articulators have to be in specified states. These states can span two or more traditional segments. For a [3] the soft palate must be elevated and the tongue must be elevated and bunched in the palatal region. Given the inertia of these articulators, these actions will necessarily have to start during the preceding vowel (if these postures had not already been attained). Typically, although many of these preparatory or perseveratory gestures leave some trace in the speech signal, listeners learn to discount them except at the moment when they contribute to a powerful acoustic modulation. The speech-perception literature provides many examples showing that listeners factor out such predictable details unless they are presented out of their normal context (i.e. where the conditioning environment has been deleted) (Ohala 1981b; Beddor, Krakow, and Goldstein 1986). My own pronunciation of "measure," with what I regard as the "monophthong" [e] in the first syllable, has a very noticeable palatal glide at the end of that vowel which makes it resemble the diphthong [ej] in a word like "made." I do not perceive that vowel as [ej], presumably because when I "parse" 9 the signal I assign the palatal on-glide to the [3], not the vowel. Other listeners may parse this glide with the vowel and thus one finds dialectally mergers of /ej/ and /e/ 9
I use "parse" in the sense introduced by Fowler (1986). 181
Segment
before palato-alveolars, e.g., spatial and special become homophones. (See also Kawasaki [1986] regarding the perceptual "invisibility" of nasalization near nasal consonants.) Thus, from a phonetic point of view such spill-over of articulatory gestures is well known (at least since the early physiological records of speech using the kymograph) and it is a constant and universal feature of speech, even before any sound change occurs which catches the attention of the linguist. Many features thus come "prespread," so to speak; they do not start unspread and then migrate to other segments. Such spill-over only affects the phonological interpretation of neighboring elements if a sound change occurs. I have presented evidence that sound change is a misapprehension or reinterpretation on the part of the listener (Ohala 1974, 1975, 1981b, 1985a, 1987,1989). Along with this reinterpretation there may be some exaggeration of aspects of the original pronunciation, e.g. the slight nasalization on a vowel may now be heavy and longer. Under this view of sound change, no person, neither the speaker nor the listener, has implemented a change in the sense of having in their mental grammar a rule that states something like /e/ -> /ej/ /_/3/; rather, the listener parses the signal in a way that differs from the way the speaker parses it. Similarly, if a reader misinterprets a carelessly handwritten "n" as the letter "u," we would not attribute to the writer or the reader the psychological act or intention characterized by the rule "n" -* "u." Such a rule would just be a description of the event from the vantage point of an observer (a linguist?) outside the speaker's and listener's domains. In the case of sound patterns of language, however, we are now able to go beyond such external, "telescoped," descriptions of events and provide realistic, detailed, accounts in terms of the underlying mechanisms. The migration of features is therefore not evidence for autosegmental representations and is not evidence capable of countering the claim that features are nonautonomous. There is no mental representation requiring unlinked or temporally uncoordinated features. 7.4 Conclusions
I have argued that features are so bound together due to physical principles and task constraints that if we started out with uncoordinated features they would have linked themselves of their own accord.10 Claims that features can be unlinked have not been made with any evident awareness of the full phonetic complexity of speech, including not only the anatomical but also the aerodynamic and the acoustic-auditory principles governing it. Thus, more than twenty years after the defect was first pointed out, phonological 10
Similar arguments are made under the heading of "feature enhancement" by Stevens, Keyser, and Kawasaki (1986) and Stevens and Keyser (1989). 182
7 John J. Ohala
representations still fail to reflect the "intrinsic content" of speech (Chomsky and Halle 1968: 400ff.). They also suffer from a failure to consider fully the kind of diachronic scenario which could give rise to apparent "spreading" of features, one of the principal motivations for unlinked features. What has been demonstrated in the autosegmental literature is that it is possible to represent speech-sound behavior using autosegments which eventually become associated with the slots in the CV skeleton. It has not been shown that it is necessary to do so. The same phonological phenomena have been represented adequately (though still not explanatorily) without autosegments. But there must be an infinity of possible ways to represent speech (indeed, we have seen several in the past twenty-five years and will no doubt see several more in the future); equally, it was possible to represent apparent solar and planetary motion with the Ptolemaic epicycles and the assumption of an earth-centered universe. But we do not have to relive history to see that simply being able to "save the appearances" of phenomena is not justification in itself for a theory. However, even more damaging than the lack of a compelling motivation for the use of autosegments, is that the concept of autosegments cannot explain the full range of phonological phenomena which involve interactions between features, a very small sample of which was discussed above. This includes the failure to account for what does not occur in phonological processes or which occurs much less commonly. On the other hand, I think phonological accounts which make reference to the full range of articulatory, acoustic, and auditory factors, supported by experiments, have a good track record in this regard (Ohala 1990a).
Comments on chapter 7 G. N. CLEMENTS In his paper "The segment: primitive or derived?" Ohala constructs what he calls a "plausibility argument" for the view that there is no level of phonological representation in which features are not coordinated with each other in a strict one-to-one fashion.* In contrast to many phoneticians who have called attention to the high degree of overlap and slippage in speech production, Ohala argues that the optimal condition for speech perception requires an alternating sequence of precisely coordinated rapid transitions and steady-states. From this observation, he concludes that the phonological *Research for this paper was supported in part by grant no. INT-8807437 from the National Science Foundation. 183
Segment
representations belonging to the mental grammars of speakers must consist of segments defined by sets of coordinated features. Ohala claims to find no phonetic or phonological motivation for theories (such as autosegmental phonology) that allow segments to be decomposed into sets of unordered features. We can agree that there is something right about the view that precise coordination of certain articulatory events is conducive to the optimal transmission of speech. This view is expressed, in autosegmental phonology, in the association conventions whose primary function is to align features in surface representation that are not aligned in underlying representation. In fact, autosegmental phonology goes a step further than this and claims that linear feature alignment is a default condition on underlying representation as well. It has been argued that a phonological system is more highly valued to the extent that its underlying representations consist of uniformly linear sequences in which segments and features are aligned in a linear, one-to-one fashion (Clements and Goldsmith 1984). Departures from this type of regularity come at a cost, in the sense that the learner must identify each type of nonlinearity and enter it as a special statement in the grammar. Far from advocating a thoroughgoing "desegmentalization of speech," the position of autosegmental phonology has been a conservative one in this respect, compared to other nonlinear theories such as Firthian prosodic analysis, which have taken desegmentalization a good deal further.'1 While we can agree that the coordination of features represents a default condition on phonological representation perhaps at all levels, this is not the whole story. The study of phonological systems shows clearly that features do not always align themselves into segments, and it is exactly for this reason that nonlinear representational systems such as autosegmental phonology have been developed. The principle empirical claim of autosegmental phonology is that phonological rules treat certain features and feature sets independently of others, in ways suggesting that features and segments are not always aligned in one-to-one fashion. The assignment of such features to independent tiers represents a direct and nonarbitrary way of expressing this functional independence. Crucial evidence for such feature autonomy comes from the operation of rules which map one structure into another: it is only when phonological features are "in motion," so to speak, that we can determine which features act together as units. Ohala responds to this claim by arguing that the rules which have been proposed in support of such functional feature independence belong to physics or history, not grammar. In his view, natural languages do not have 11
See, for example, Local (this volume). Recently, Archangeli (1988) has argued for full desegmentalization of underlying representations within a version of underspecification theory. 184
7 Comments
synchronic assimilation rules, though constraints on segment sequences may reflect the reinterpretation (or "misanalysis") of phonetic processes operating at earlier historical periods. If true, this argument seriously undermines the theory at its basis. But is it true? Do we have any criteria for determining when a detectable linguistic generalization is a synchronic rule? This issue has been raised elsewhere by Ohala himself. He has frequently expressed the view that a grammar which aims at proposing a model of speaker competence must distinguish between regularities which the speaker is "aware" of, in the sense that they are used productively, and those that are present only for historical reasons, and which do not form part of the speaker's grammar (see e.g. Ohala 1974; Ohala and Jaeger 1986). In this view, the mere existence of a detectable regularity is not by itself evidence that it is incorporated into the mental grammar as a synchronic rule. If a regularity has the status of a rule, we expect it to meet what we might call the productivity standard: the rule should apply to new forms which the speaker has not previously encountered, or which cannot have been plausibly memorized. Since its beginnings, autosegmental phonology has been based on the study of productive rules in just this sense, and has based its major theoretical findings on such rules. Thus, for example, in some of the earliest work, Leben (1973) showed that when Bambara words combine into larger phrases, their surface tone melody varies in regular ways. This result has been confirmed and extended in more recent work showing that the surface tone pattern of any word depends on tonal and grammatical information contributed by the sentence as a whole (Rialland and Badjime 1989). Unless all such phrases are memorized, we must assume that unlinked "tone melodies" constitute an autonomous functional unit in the phonological composition of Bambara words. Many other studies in autosegmental phonology are based on productive rules of this sort. The rules involved in the Igbo and Kikuyu tone systems, for example, apply across word boundaries and, in the case of Kikuyu, can affect multiple sequences of words at once. Unless we are willing to believe that entire sentences are listed in the lexicon, we are forced to conclude that the rules are productive, and part of the synchronic grammar.12 Such evidence is not restricted to tonal phenomena. In Icelandic, preaspirated stops are created by the deletion of the supralaryngeal features of thefirstmember of a geminate unaspirated stop and of the laryngeal features of the second. In his study of this phenomenon, Thrainsson (1978) shows at considerable length that it satisfies a variety of productivity criteria, and must be part of a synchronic grammar. In Luganda, the rules whose operation brings to light 12
For Igbo see Goldsmith (1976), Clark (1990); for Kikuyu see Clements (1984) and references therein. 185
Segment
the striking phenomenon of "mora stability" apply not only within morphologically complex words but also across word boundaries. The independence of the CV skeleton in this language is confirmed not only by regular alternations, but also by the children's play language called Ludikya, in which the segmental content of syllables is reversed while length and tone remain constant (Clements 1986). In English, the rule of intrusive stop formation which inserts a brief [t] in words like prince provides evidence for treating the features characterizing oral occlusion as an autosegmental node in hierarchical feature representation (see Clements 1987 for discussion); the productivity of this rule has never been questioned, and is experimentally demonstrated in Ohala (1981a). In sum, the argumentation upon which autosegmental phonology is based has regularly met the productivity standard as Ohala and others have characterized it. We are a long way from the days when to show that a regularity represented a synchronic rule, it was considered sufficient just to note that it existed. But even if we agree that autosegmental rules constitute a synchronic reality, a fundamental question still remains: if, as Ohala argues, linear coordination of features represents the optimal condition for speech perception, why do we find feature asynchrony at all? The reasons for this lie, at least in part, in the fact that phonological structure involves not only perceptually motivated constraints, but also articulatorily motivated constraints, as well as higher-order grammatical considerations. Phonology (in the large sense, including much of what is traditionally viewed as phonetics) is concerned with the mapping between abstract lexicosyntactic representations and their primary medium of expression, articulated speech. At one end of this mapping we find linguistic structures whose formal organization is hierarchical rather than linear, and at the other end we find complex interactions of articulators involving various degrees of neural and muscular synergy and inertia. Neither type of structure lends itself readily or insightfully to expression in terms of linear sequences of primitives (segments) or stacks of primitives (feature bundles). In many cases, we find that feature asynchrony is regularly characteristic of phonological systems in which features and feature sets larger and smaller than the segment have a grammatical or morphological function. For instance, in languages where tone typically serves a grammatical function (as in Bantu languages, or many West African languages), we find a greater mismatch between underlying and surface representations than in languages where its function is largely lexical (as in Chinese). In Bambara, to take an example cited earlier, the floating low tone represents the definite article and the floating high tone is a phrasal-boundary marker; while in Igbo, the floating tone is the associative-construction marker. The well-known nonlinearities found in Semitic verb morphology are due to the fact that 186
7 Comments
consonants, vowels, and templates all play a separate grammatical role in the make-up of the word (McCarthy 1981). In many further cases, autosegmentalized features have the status of morpheme-level features, rather than segment-level features (to use terminology first suggested by Robert Vago). Thus in many vowel-harmony systems, the harmonic feature (palatality, ATR (Advanced Tongue Root), etc.) commutes at the level of the root or morpheme, not the segment. In Japanese, as Pierrehumbert and Beckman point out (1988), some tones characterize the morpheme while others characterize the phrase, a fact which these authors represent by linking each tone to the node it characterizes. What these and other examples suggest is that nonlinearities tend to arise in a system to the extent that certain subsets of features have a morphological or syntactic role independent of others. Other types of asynchronies between features appear to have articulatory motivation, reflecting the relative sluggishness of some articulators with respect to others (cf. intrusive stop formation, nasal harmonies, etc.), while others may have functional or perceptual motivation (the Icelandic preaspiration rule preserves the distinction between underlying aspirated and unaspirated geminate stops from the effects of a potentially neutralizing deaspiration rule, but it translates this distinction into one between preaspirated and unaspirated geminates). If all such asynchronies represent departures from the optimal or "default" state and in this way add to the formal complexity of a phonological representation, then many of the rules and principles of autosegmental phonology can be viewed as motivated by the general, overriding principle: reduce complexity. Ohala argues that "feature geometry," a model which uses evidence from phonological rules to specify a hierarchical organization among features (Clements 1985; Sagey 1986a; McCarthy 1988), does not capture all observable phonetic dependencies among features, and is therefore incomplete. However, feature geometry captures a number of significant cross-linguistic generalizations that could not be captured in less structured feature systems, such as the fact that the features defining place of articulation commonly function as a unit in assimilation rules. True, it does not and cannot express certain further dependencies, such as the fact that labiality combines less optimally with stop production (as in [p]) than do apical or velar closure (as in [t] or [k]). But not all such generalizations form part of phonological grammars. Thus, phonologists have not discovered any tendency for rules to refer to the set of all stops except [p] as a natural class. On the contrary, in spite of its less efficient exploitation of vocal-tract mechanics, [p] consistently patterns with [t] and [k] in rules referring to oral stops, reflecting the general tendency of phonological systems to impose a symmetrical classification upon speech sounds sharing linguistically significant properties. If feature geometry attempted to derive all phonetic as well as phonological dependen187
Segment
cies from its formalism, it would fail to make correct predictions about crosslinguistically favored rule types, in this and many other cases. Ohala rejects autosegmental phonology (and indeed all formal approaches to phonology) on the grounds that its formal principles may have an ultimate explanation in physics and psychoacoustics, and should therefore be superfluous. But this argument overlooks the fact that physics and psychology are extremely complex sciences, which are subject to multiple (and often conflicting) interpretations of phenomena in almost every area. Physics and psychology in their present state can shed light on some aspects of phonological systems, but, taken together, they are far from being able to offer the hard, falsifiable predictions that Ohala's reductionist program requires if it is to acquire the status of a predictive empirical theory. In particular, it is often difficult to determine on a priori grounds whether articulatory, aerodynamic, acoustic, or perceptual considerations play the predominant role in any given case, and these different perspectives often lead to conflicting expectations. It is just here that the necessity for formal models becomes apparent. The advantage of using formal models in linguistics (and other sciences) is that they can help us to formulate and test hypotheses within the domain of study even when we do not yet know what their ultimate explanation might be. If we abandoned our models on the grounds that we cannot yet explain and interpret them in terms of higher-level principles, as Ohala's program requires, we would make many types of discovery impossible. To take one familiar example: Newton could not explain the Law of Gravity to the satisfaction of his contemporaries, but he could give a mathematical statement of it - and this statement proved to have considerable explanatory power. It is quite possible that, ultimately, all linguistic (and other cognitive) phenomena will be shown to be grounded in physical, biological, and psychological principles in the largest sense, and that what is specific to the language faculty may itself, as some have argued, have an evolutionary explanation. But this possibility does not commit us to a reductionist philosophy of linguistics. Indeed, it is only by constructing explicit, predictive formal or mathematical models that we can identify generalizations whose relations to language-external phenomena (if they exist) may one day become clear. This is the procedure of modern science, which has been described as follows by one theoretical physicist (Hawking 1988: 10): "A theory is a good theory if it satisfies two requirements: It must ultimately describe a large class of observations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations." This view is just as applicable to linguistics and phonetics as it is to physics. Ohala's paper contains many challenging and useful ideas, but it over188
7 Comments
states its case by a considerable margin. We can agree that temporal coordination plays an important role in speech production and perception without concluding that phonological representations are uniformly segmental. On the contrary, the evidence from a wide and typologically diverse number of languages involving both "suprasegmental" and traditionally segmental features demonstrates massively and convincingly that phonological systems tolerate asynchronic relations among features at all levels of representation. This fact of phonological structure has both linguistic and phonetic motivation. Although phonetics and phonology follow partly different methodologies and may (as in this case) generate different hypotheses about the nature of phonological structure, the results of each approach help to illuminate the other, and take us further towards our goal of providing a complete theory of the relationship between discrete linguistic structure and the biophysical continuum which serves as its medium.
189
8 Modeling assimilation in nonsegmental, rule-free synthesis JOHN LOCAL
8.1 Introduction
Only relatively recently have phonologists begun the task of seriously testing and evaluating their claims in a rigorous fashion.1 In this paper, I attempt to sustain this task by discussing a computationally explicit version of one kind of structured phonology, based on the Firthian prosodic approach to phonological interpretation, which is implemented as an intelligent knowledge-based "front-end" to a laboratory formant speech synthesizer (Klatt 1980). My purpose here is to report on how the superiority of a structured monostratal approach to phonology over catenative segmental approaches can be demonstrated in practice. The approach discussed here compels new standards of formal explicitness in the phonological domain as well as a need to pay serious attention to parametric and temporal detail in the phonetic domain (see Browman and Goldstein 1985, 1986). The paper falls into two parts: the first outlines the nonsegmental approach to phonological interpretation and representation; the second gives an overview of the way "process" phenomena are treated within this rule-free approach and presents an analysis of some assimilatory phenomena in English and the implementation of that analysis within the synthesis model. Although the treatment of assimilation presented here is similar to some current proposals (e.g. Lodge 1984), it is, to the best of my knowledge, unique in having been given an explicit computational implementation. The approach to phonological analysis presented here is discussed at length in Kelly and Local (1989), where a wide range and variety of languages are considered. The synthesis of English which forms part of our work in phonological theory is supported by a grant from British Telecom PLC. The work is collaborative and is being carried out by John Coleman and myself. Though my name appears as the author of this paper I owe a great debt to John Coleman, without whom this work could not have taken the shape it does. 190
8 John Local
8.2 Problems with segmental, rewrite-rule phonologies
In order to give some sense to the terms "nonsegmental" and "rule-free" which appear in the title, it is necessary to sketch the broad outlines of our approach to phonological interpretation and representation. As a prelude to this, it is useful to begin by considering some of the problems inherent in rulebased, segmental-type phonologies. The most explicit version of such a segmental phonology, transformational generative phonology (TGP), is a sophisticated device for deriving strings of surface phonetic segments from nonredundant lexical strings of phonemes. TGP rewrite rules simply map well-ordered "phonological" strings onto well-ordered "phonetic strings." For some time TGP has been the dominant phonological framework for textto-speech synthesis research. There have been a number of general computational, practical, and empirical arguments directed against weaknesses in the TGP approach (see, for example, Botha 1971; Ladefoged 1971; Johnson 1972; Koutsoudas, Sanders, and Noll 1974; Pullum 1978; Linnell 1979; see also Peters and Ritchie 1972; Peters 1973; Lapointe 1977; King 1983, for more general critiques of the computational aspects of TG). For instance, because of the richness and complexity of the class of rules which may be admitted by a TGP model, it may be impossible in practice to derive an optimal TGP. This problem is exacerbated by rule interaction and rule-ordering paradoxes. Moreover, standard TGP models do not explicitly recognize structural domains such as foot, syllable, and onset. This means that structuredependent information cannot be felicitously represented and some additional mechanism(s) must be sought to handle "allophony," "coarticulation," and the like; notice that the frequent appeals to "boundary" features or to "stress," for instance, tacitly trade on structural dependence. The deletion and insertion rules exploited in TGP permit arbitrarily long portions of strings to be removed or added. This is highly problematic. Leaving aside the computational issues (that by admitting deletion rules TGP may define nonrecursive languages) the empirical basis for deletion processes has never been demonstrated. Indeed, the whole excessively procedural, "process"oriented approach (of which "deletion" and "insertion" are but two exemplars) embodied by TGP has never been seriously warranted, defended, or motivated. Numerous criticisms have also been leveled against the TGP model in recent years by the proponents of phonological theories such as autosegmental phonology, dependency phonology, and metrical phonology. These criticisms have been largely directed at the "linearity" and "segmentality" of TGP and the range and complexity of the rules allowed in TGP. An attempt has been made to reduce the number and baroqueness of "rules" required, 191
Segment
and much simpler, more general operations and constraints have been proposed. However, despite the claims of their proponents, these approaches embody certain of the entrenched problematic facets of TGP. Transformation rewrite rules (especially deletion and insertion rules) are still employed whenever it is convenient (e.g. Anderson 1974; Clements 1976; McCarthy 1981; Goldsmith 1984). Such rules are only required because strings continue to be central to phonological representation and because phonological and phonetic representations are cast in the same terms and treated as if they make reference to the same kinds of categories. It is true that the use of more sophisticated representations in these approaches has gone some way towards remediating the lack of appropriate structural domains in TGP, but "nonlinear" analyses are still, in essence, segment-oriented and typically treat long-domain phonological units as being merely extended or spread short-domain units. The reasons for treating long-domain units in this way have never been explicated or justified. Nor do any of these "nonlinear" approaches seriously question the long-standing hypothesis that strings of concatenated segments are appropriate phonological representations. Dependency graphs and metrical trees are merely graphical ways of representing structured strings, and autosegmental graphs simply consist of well-ordered, parallel strings linked by synchronizing "association lines." Rather than engaging in a formal investigation of phonological interpretation, these three frameworks have slipped into the trap of investigating the properties of diagrams on paper (see Coleman and Local 1991). The problems I have identified and sketched here have important consequences for both phonology in general and for synthesis-by-rule systems in particular. It is possible, however, to avoid all these problems by identifying and removing their causes; in doing this we aim to determine a more restrictive phonological model. 8.3 Nonsegmental, declarative phonology
Our attempt to construct a more restrictive theory of phonology than those currently available has two main sources of inspiration: Firthian prosodic phonology (Firth 1957) and the work on unification-grammar (UG) formalism (Shieber 1986). The main characteristics of our approach to phonology are that it is abstract; it is monostratal; it is structured; it is monotonic.
192
8 John Local 83.1 Abstractness: phonology and phonetics demarcation One of the central aspects of Firthian approach to phonology,2 and one that still distinguishes it from much current work, is the insistence on a strict distinction between phonetics and phonology (see also Pierrehumbert and Beckman 1988). This is a central commitment in our work. We take seriously Trubetzkoy's dictum that: The data for the study of the articulatory as well as the acoustic aspects of speech sounds can only be gathered from concrete speech events. In contrast, the linguistic values of sounds to be examined by phonology are abstract in nature. They are above all relations, oppositions, etc., quite intangible things, which can be neither perceived nor studied with the aid of the sense of hearing or touch. (Trubetzkoy 1939 [1969]: 13) Like the Firthians and Trubetzkoy, we take phonology to be relational: it is a study whose descriptions are constructed in terms of structural and systemic contrast; in terms of distribution, alternation, opposition, and composition. Our formal approach, then, treats phonology as abstract; this has a number of important consequences. For example, this enables us (like the Firthians) to employ a restrictive, monostratal phonological representation. There is only one level of phonological representation and one level of phonetic representation; there are no derivational steps. This means that for us it would be incoherent to say such things as "a process that converts a high tone into a rising tone following a low tone" (Kaye 1988: 1) or "a striking feature of many Canadian dialects of English is the implementation of the diphthongs [ay] and [aw] as [Ay] and [AW]" (Bromberger and Halle 1989: 58); formulations such as these simply confuse phonological categories with their phonetic exponents. Because phonological descriptions and representations encode relational information they are abstract, algebraic objects appropriately formulated in the domain of set theory. In contrast, phonetic representations are descriptions of physical, temporal events formulated in a physical domain. This being the case, it makes no sense to talk of structure or systems in phonetics: there may be differences between portions of utterance, but in the phonetics there can be no "distinctions." The relationship between phonology and phonetics is arbitrary (in the sense of Saussure) but systematic; I know of no evidence that suggests otherwise (see also Lindau and Ladefoged 1986). The precise form 2
One important aspect of our approach which I will not discuss here (but see Kelly and Local 1989) is the way we conduct phonological interpretation. (The consequences of the kinds of phonological interpretation we do can, to some extent, be discerned via our chosen mode of representation.) Nor does space permit a proper discussion of how a declarative approach to phonology deals with morphophonological alternations. However, declarative treatment of such alternations poses no particular problems and requires no additional formal machinery. Firthian prosodic phonology provides a nonprocess model of such alternations (see, e.g., Sprigg 1963). 193
Segment
of phonetic representations therefore has no bearing on the form of the phonological representations. Phonetics, is, then, to be seen as interpretive: phonetic representations are the denotations of phonological representations. Thus, like the Firthians, we talk of phonetic exponents of the phonological representations, and like them we do not countenance rules that manipulate these phonetic exponents. This phonetic interpretation of the phonological representations is compositional and couched parametrically. By "compositional interpretation" of phonological representations (expressions) I mean simply that the phonetic interpretation (in terms of values and times) of an expression is a function of the interpretation of the component parts of that expression (Dowty, Wall, and Peters, 1981). By "parametric" I mean to indicate that it is essential not to restrict phonetic interpretation to the segmented domains and categories superficially imposed by, say, the shapes of an impressionistic record on the language material. This principle leads to the formulation of phonetic representations in terms of component parameters and their synchronization in time. Within this approach, parametric phonetic interpretation (exponency) is a relation between phonological features at temporally interpreted nodes in the syllable tree and sets of parameter sections (in our implementation Klatt parameters). Exponency is thus a function from phonological features and their structural contexts to parameter values. "Structural contexts" is here taken to mean the set of features which accompany the feature under consideration at its place in structure along with that place in structure in the graph. Phonetic interpretation of the syllable graph is performed in a headfirst fashion (constituent heads are interpreted before their dependents; see Coleman 1989), and parameter sections are simply sequences of ordered pairs, where each pair denotes the value of a particular parameter at a particular (linguistically relevant) time. Thus: {node(Category, Tstart, Tend), parameter.section} (see Ladefoged [1980:495] for a similar formulation of the mapping from phonological categories to phonetic parameters). The times may be absolute or relative. So, for example, given a phonological analysis of the Firthian prosodic kind, which has abstracted at least two independent V-systems (e.g. "short" and "long"; see Albrow 1975) we can establish that only three "height" values need to be systematically distinguished. Given, in addition, some analysis in the acoustic domain such as that presented by Kewley-Port (1982) or Klatt (1980: 986) we can begin to provide a phonetic interpretation for the [height 1] feature for syllables such as pit and put thus: {syllable([heightl], Tp T2),(F1:425; Bl:65), (Fl:485; Bl:65)} (Interpolation between the values can be modeled by a damped sinusoid [see Browman and Goldstein 1985].) As indicated above, however, any given 194
8 John Local
phonological feature is to be interpreted in the context of the set of features which accompany the particular feature under consideration along with its place in structure. This means that a feature such as [heightl] will not always receive the same phonetic interpretation. Employing again the data from Klatt (1980) we can provide a phonetic interpretation for the [heightl] feature for syllables such as see and sue thus: {syllable([heightl],T1,T2,T3,T4),(Fl:33O;Bl:55),(Fl:33O;Bl:55), (Fl:350; Bl:60),(Fl:350; Bl:60)} Ladefoged (1977) discusses such a structure-dependent interpretation of phonological features. He writes: "the matrix [of synthesizer values (and presumably of some foregoing acoustic analysis)] has different values for F2 (which correlates with place of articulation) for [s] and [t], although these segments are both [ + alveolar]. The feature Alveolar has to be given a different interpretation for fricatives and plosives" (1977: 231). We should note, however, that the domain of these phonetic parameters - whether they are formulated in articulatory or acoustic terms, say, has to do with the task at hand - is implementation-specific; it has not, and cannot have, any implications whatsoever for the phonological theory. The position I have just outlined is not universally accepted. Many linguists view phonology and phonetics as forming a continuum with phonological descriptions presented in the same terms as phonetic descriptions; these are often formulated in terms of a supposedly universal set of distinctive phonetic properties. This can be seen in the way phonological features are employed in most contemporary approaches. Phonological categories are typically constructed from the names of phonetic (or quasiphonetic) categories, such as "close, back, vowel," or "voiced, noncoronal, obstruent." By doing this the impression of a phonology-phonetics continuum is maintained. "Phonetic" representations in generative phonologies are merely the end result of a process of mappings from strings to strings; the phonological representations are constructed from features taking binary values, the phonetic representations employing the same features usually taking scalar values. Chomsky and Halle explicitly assert this phoneticsphonology continuum when they write: "We take 'distinctive features' to be the minimal elements of which phonetic, lexical, and phonological transcriptions are composed, by combination and concatenation" (1968: 64). One reason that this kind of position is possible at all, of course, is that in such approaches there is nothing like an explicit phonetic representation. Typically, all that is provided are a few feature names, or segment symbols; rarely is there any indication of what would be involved in constructing the algorithm that would allow us to test such claims. The impression of a phonology-phonetics continuum is, of course, quite illusory. In large part, 195
Segment
the illusion is sustained by the entirely erroneous belief that the phonological categories have some kind of implicit, naive phonetic denotation (this seems to be part of what underlies the search for invariant phonetic correlates of phonological categories and the obsession with feature names [Keating 1988b]). Despite some explicit claims to this effect (see, for example, Chomsky and Halle 1968; Kaye, Lowenstamm, and Vernaud 1985; Bromberger and Halle 1989), phonological features do not have implicit denotations and it is irresponsible of phonologists to behave as if they had when they present uninterpreted notations in their work. One of the advantages that accrues from having a parametric phonetic interpretation distinguished from phonological representation is that the arbitrary separation of "segmental" from "supra-" or "non-"segmental features can be dispensed with. After all, the exponents of so-called "segmental" phonetic aspects are no different from those of "nonsegmental" ones they are all parameters having various extents. Indeed, any coherent account of the exponents of "nonsegmental" components will find it necessary to refer to "segmental" features: for instance, as pointed out by Adrian Simpson (pers. comm.), lip-rounding and glottality are amongst the phonetic exponents of accentuation (stress). Compare to him, and for him, for example, in versions where to and for are either stressed or nonstressed. Local (1990) also shows that the particular quality differences found in finalopen syllable vocoids in words like city and seedy are exponents of the metrical structure of the words (see also the experimental findings of Beckman 1986). 8.3.2 Structured representation Within our approach, phonological representations are structured and are treated as labeled graph-objects, not strings of symbols. These graphs are unordered. It makes no sense to talk about linear ordering of phonological elements - such a formulation can only make sense in terms of temporal phonetic interpretation. (Compare Carnochan: It is perhaps appropriate to emphasise here that order and place in structure do not correlate with sequence in time ... The symbols with which a phonological structure is written appear on the printed page in a sequence; in vDEvhBasry • • • structure of jefaffe, the symbol h precedes the symbol B, but one must guard against the assumption that the exponent of the element of structure h precedes the exponent of the element of structure B, in the pronunciation of jefaffe. There is no time in structure, there is no sequence in structure; time and sequence are with reference to the utterance, order and place are with reference to structure [1952:158].) Indeed, it is largely because phonologists have misinterpreted phonological representations as well ordered, with an order that supposedly maps straight196
8 John Local
forwardly to the phonetics, that "processes" have been postulated. Syntagmatic structure, which shows how smaller representations are built up into larger ones, is represented by graphs. The graphs are familiar from the syllable-tree notation ubiquitous in phonology (see Fudge 1987 for copious references). The graphs we employ, however, are directed acyclical graphs (DAGs) rather than trees, since we admit the possibility of multidominated nodes ("reentrant" structures; see Shieber, 1986), for instance, in the representation of ambisyllabicity and larger structures built with feature sharing, e.g. coda-onset "assimilation": syllable
(1)
rime onset coda
As will become apparent later, our interpretation of the phonetic exponents of such constituent relationships is rather different from that currently found in the phonological literature. Such phonological structures are built by means of (phonotactic) phrase-structure rules. 8.3.3 Feature structures
Paradigmatic structure, which shows how informational differences between representations are encoded, is represented using feature structures (in fact these, too, are graph structures). Feature structures are partial functions from their features to values. Within a category each feature may take only one value. The value of a feature may be atomic (e.g. [height: close], [cmp: — ]) or may itself be structured. This means that the value of some feature may itself be specified by another feature structure rather than by an atomic value. For instance (2a) might be one component of a larger structure associated with an onset node, and (2b) might be a component of a structure associated with a nucleus node: (2)
cons:
grv: cmp: +
voc:
197
height: close grv: +
Segment
(Hierarchical phonological feature structures have also been proposed by Lass [1984b], Clements [1985], and Sagey [1986a], though their interpretation and use is different from that proposed here.) The primary motivation for adopting a graph-theoretic view of phonological representations is that this enables us to formulate our proposals within a mathematically explicit and well-understood formalism. By assigning an independent node to every phonological unit of whatever extent, we do away with the need to recognize "segments" at any level of representation as well as with the need to postulate such things as "spreading" rules. That is, we employ a purely declarative mode of representation. (Although the term "declarative" was not in current usage during the period when Firthian prosodic analysis was in development, it is clear that the nonprocess orientation of prosodic phonology was declarative in spirit. For further discussion of this claim see the reinterpretation by Broe [1988] of Allen's [prosodic] treatment of aspiration in Harauti within a declarative unification framework.) It is particularly important to note that within this approach no primacy is given to any particular size of node domain. If we consider the phonological specification of the contrasts in the words/?// ~ put, bat ~ bad, and bent ~ bend, we can get a feel for some of the implications of structured representations. In a typical segmental (or, for that matter, "nonlinear") account words such as pit and put are likely to have their vowels specified as [ + rnd] (either by explicit feature specification or as a result of some default or underspeciflcation mechanism) which is then copied, or spread, to adjacent consonants in order to reflect the facts of coarticulation of lip rounding. In an approach which employs structured representations, however, all that is necessary is to specify the domain of [ + rnd] as being the whole syllable. Thus: (3)
syllable [ ± rnd]
onset
X
\coda
Onset and coda are in the domain of [ ± rnd] (though coda does not bear the feature distinctively) by virtue of their occurrence as syllable constituents. Once the structural domain of oppositions is established there is no need to 198
8 John Local
employ process modeling such as "copying" or "spreading." In a similar fashion nonstructured phonologies will typically specify the value of the contrast feature [± voice] for consonants, whereas vowels will typically be specified (explicitly, or default-specified) as being [+ voi]. A structured phonological representation, on the other hand, will explicitly recognize that the opposition of voicing holds over onsets and rimes, not over consonants and vowels (see Browman and Goldstein 1986: 227; Sprigg 1972). Thus vowels will be left unspecified for voicing and the phonological voicing distinction between bat ~ bad and bent ~ bend, rather than being assigned to a coda domain, will be assigned to a rime domain: (4)
syllable rime onset /
\
[ + voi ]
nucleus /
\coda
If this is done, then the differences in voice quality and duration of the vowel (and, in the case of bent ~ bend, the differences in the quality and duration of nasality), the differences in the nature of the transitions into the closure and the release of that closure can be treated in a coherent and unified fashion as the exponents of rime-domain voicing opposition. (The similarity between these kinds of claims and representations and the sorts of representations found in prosodic analysis should be obvious.3) The illustrative representations I have given have involved the use of rather conventional phonological features. The features names that we use in the construction of phonological representations are, in the main, taken from the Jakobson, Fant, and Halle (1952) set, though nothing of particular interest hangs on this. However, they do differ from the "distinctive features" described by Jakobson, Fant, and Halle in that, as I indicated earlier, they are purely phonological; they have no implicit phonetic interpretation. They differ from the Jakobson, Fant, and Halle features in two other respects. First, when they are interpreted (because of the compositionality principle) they do not have the same interpretation wherever they occur in structure. 3
For example, it is not uncommon (e.g. Albrow 1975) to find "formulae" like yw(CViCh/fi) as partial representations of the systemic (syntagmatic and paradigmatic) contrasts instantiated by words such as pit and put.
199
Segment
So, for instance, the feature [ + voi] at onset will not receive the same interpretation as [+ voi] at rime. Second, by employing hierarchical, category-valued features, such as [cons] and [voc] as in (5) below, it enables the same feature names to be used within the same feature structure but with different values. Here, in the partial feature structure relating to apicality with open approximation and velarity at onset, [grv] with different values is employed to encode appropriate aspects of the primary and secondary articulation of the consonantal portion beginning a word such as red. (5)
onset:
cons:
grv: cmp:
voc:
grv: height: rnd:
src:
nas:
+ 0 + —
Once the representations in the phonology are treated as structured and strictly demarcated from parametric phonetic representations, then notions such as "rewrite rule" are redundant. Nor do we need to have recourse to operations such as deletion and insertion. We can treat such "processes" as different kinds of parameter synchronization in the phonetic interpretation of the phonological representation. This position is, of course, not new or unique to the approach I am sketching here (see e.g. Browman and Goldstein 1986). Even within more mainstream generative phonologies similar proposals have been made (Anderson 1974; Mohanan 1986), though not implemented in the thoroughgoing way I suggest here. As far as I can see, however, none of these accounts construe "phonology" in the sense employed here and all trade in well-ordered concatenative relations. One result of adopting this position is that in our approach phonological combinators are formulated as nondestructive - phonological operations are monotonic. The combinatorial operations with which phonological structures are constructed can only add information to representations and cannot remove anything - "information about the properties of an expression once gained is never lost" (Klein 1987:5). As I will show in the remainder of the paper, this nondestructive orientation has implications for the underspecification of feature values in phonological representations. 200
8 John Local 8.4 Temporal interpretation: dealing with "processes"
Consider the following syllable graph: (6)
syllable rime onset
s
/
coda
A simple temporal interpretation of this might be schematically represented thus: (7) Syllable exponents Onset exponents
Rime exponents Nucleus exponents
Coda exponents
V
c
Although this temporal interpretation reflects some of the hierarchical aspects of the graph structure (the exponents' "smaller" constituents overlap those of "larger" ones), the interpretation is still essentially concatenative. It treats phonetic exponency as if it consisted of nothing more than wellordered concatenated sequences of "phonetic objects" of some kind (as in SPE-type approaches or as suggested by the X-slot type of analysis found in autosegmental work). As is well known, however, from extensive instrumental studies, a concatenative-grouping view of phonetic realization simply does not accord with observation - no matter how much we try to fiddle our interpretation of "segment." Clearly, we need a more refined view of the temporal interpretation of exponency. Within the experimental/instrumental phonetics literature just such a view can be found, though as yet it has not found widespread acceptance in the phonological domain. Work by Fowler (1977), Gay (1977), Ohman (1966a), and Perkell (1969), although all conducted with a phonemic orientation, proposes in various ways a "coproduction" (Fowler) or an "overlaying" view of speech organization. When we give the graph above just such an overlaying/cocatenative rather than concatenative interpretation we can begin to see how phenomena such as 201
Segment coarticulation, deletion, and insertion can be given a nonprocess, declarative representation: (8)
Syllable exponents Rime exponents Nucleus exponents Onset exponents
Coda exponents
V C C (The "box notation" is employed in a purely illustrative fashion and has no formal status. So, for instance, the vertical lines indicating the ends of exponents of constituents should not be taken to indicate an absolute crossparametric temporal synchrony.) 8.4.1 Cocatenation, "coarticulation,"and
"deletion"
With this model of temporal interpretation of the exponents of phonological constituents we can now see very simple possibilities for the reconstrual of process phenomena. For example, so-called initial consonant-vowel coarticulation can be viewed as an "overlaying" of the exponents of phonological constituents rather than as the usual segment concatenation with a copying or spreading rule to ensure that the consonant "takes on" the appropriate characteristics of the vowel. Employing the box notation used above, this can be shown for the English words keep, cart, and coot, where the observed details of the initial occlusive portion involve (in part) respectively, tongue-fronted articulation, tongue-retracted articulation, and lip-rounded articulation (Greek symbols are used to index the constituents whose phonetic exponents we are considering): (9) T exponents a exponents K exponents k
n exponents
K exponents k
keep
cart
v exponents K exponents
fc
coot
202
i exponents
x exponents
8 John Local
Notice that the temporal-overlay account (when interpreted parametrically) will result in just the right kind of "vocalic" nucleus characteristics throughout the initial occlusive part of the syllable even though onset and nucleus are not sisters in the phonological structure. Notice, too, that we consider such "coarticulation" as phonological and not simply some phonetic-mechanical effect, since it is easy to demonstrate that it varies from language to language, and, for that matter, within English, from dialect to dialect. A similar interpretation can be given to the onsets of words such as split and sprit, where the initial periods of friction are qualitatively different as are the vocalic portions (this is particularly noticeable in those accents of English which realize the period of friction in spr- words with noticeably lip-rounded tongue-tip retracted palato-alveolarity). In these cases what we want to say is that the liquid constituent dominates the initial cluster so its exponents are coextensive with both the initial friction and occlusion and with the early portion of the exponents of the nucleus. One aspect of this account that I have not made explicit is the tantalizing possibility that only overlaying/ cocatenation need be postulated and that apparently concatenative phenomena are simply a product of different temporal gluings; only permitting one combinatorial operation is a step towards making the model genuinely more restrictive. In this context consider now the typical putative process phenomenon of deletion. As an example consider versions of the words tyrannical and torrential as produced by one (British English) speaker. Conventional accounts (e.g. Gimson 1970; Lass 1985; Dalby 1986; Mohanan 1986) of the tempo/stylistic reduced/elided pronunciations of the first, unstressed syllables of tyrannical and torrential as (10)
\r tyx*
would argue simply that the vowel segment had been deleted. But where does this deletion take place? In the phonology? Or the phonetics? Or both? The notion of phonological features "changing" or being "deleted" is, as I indicated earlier, highly problematical. However, if one observes carefully the phonetic detail of such purportedly "elided" utterances, it is actually difficult to find evidence that things have been deleted. The phonetic detail suggests, rather, that the same material has simply been temporally redistributed (i.e. these things are not phonologically different; they differ merely in terms of their temporal phonetic interpretation). Even a cursory listening reveals that the beginnings of the "elided" forms of tyrannical and torrential do not sound the same. They differ, for instance, in the extent of their liprounding, and in terms of their resonances. In the "elided" form of tyr the liprounding is coincident with the period of friction, whereas in tor it is observable from the beginning of the closure; tor has markedly back 203
Segment resonance throughout compared with tyr, which has front of central resonance throughout. (This case is not unusual: compare the "elided" forms of suppose, secure, prepose, propose, and the do ~ dew and cologne ~ clone cases discussed by Kelly and Local 1989: part 4.) A deletion-process account of such material would appear to be a codification of a not particularly attentive observation of the material. By contrast, a cocatenation account of such phenomena obviates the need to postulate destructive rules and allows us to take account of the observed phonetics and argue that the phonological representation and ingredients of elided and nonelided forms are the same; all that is different is the temporal phonetic interpretation of the constituents. The "unreduced" forms have a temporal organization schematically represented as follows: (11) i exponents T exponents
o exponents
p exponents
T exponents
p exponents
while the reduced forms may have the exponents of nucleus of such a duration that their end is coincident with the end of the onset exponents: (12) i exponents
o exponents
T exponents
T exponents
p exponents
p exponents
In the "reduced" forms, an overlaying interpretation reflects exactly the fact that although there may be no apparently "sequenced" "vowel" element the appropriate vocalic resonance is audible during the initial portion. 8.4.2 Assimilation Having provided a basic outline of some of the major features of a "nonsegmental," "nonprocess" approach to phonology, I will now examine the ways such an approach can deal with the phenomenon of "assimilation." Specifically, I want to consider those "assimilations" occurring between the end of one word and the beginning of the next, usually described as involving "alveolar consonants." The standard story about assimilation, so-called, in English can be found in Sweet (1877), Jones (1940), and Gimson (1970), often illustrated with the same examples. Roach (1983:14) provides a recent formulation: For example, thefinalconsonant in "that" daet is alveolar t. In rapid casual speech the t will become p before a bilabial consonant... Before a dental consonant, t changes 204
8 John Local to a dental plosive... Before a velar consonant, the t will become k . . . In similar
contexts d would become, b, d, g, respectively, and n would become m, n and r j . . . S becomes, j, and z becomes 3 when followed by J or j . I want to suggest that this, and formulations like it (see below), give a somewhat misleading description of the phenomenon under consideration. The reasons for this are manifold: in part, the problem lies in not drawing a clear distinction between phonology and phonetics; in part, a tacit assumption that there are "phonological segments" and/or "phonetic segments"; in part, not having very much of interest to say in the phonetic domain; in part, perhaps, a lack of thoroughgoing phonetic observation. Consider the following (reasonably arbitrary) selection of quotes and analyses centered on "assimilation": 1
"the underlying consonant is /t/ and that we have rules to the effect: t ->k/ [ + cons — ant — cor]" ' [ + nasal ] -• [ a place ]/
— sonorant — continuant a place
(Mohanan 1986:106).
"The change from /s/ to [s] . . . takes place in the postlexical module in I mi[s] you" (Mohanan 1986:7). ' T h e [mp] sequence derived from np [ten pounds] is phonetically identical to the [mp] sequence derived from mp" (Mohanan 1986:178). "the alveolars . . . are particularly apt to undergo neutralization as redundant oppositions in connected speech" (Gimson 1970:295). "Assimilation of voicing may take place when a word ends in a voiced consonant before a word beginning with a voiceless consonant" (Barry 1984:5). "We need to discover whether or not the phonological processes discernible in fast speech are fundamentally different from those of careful slow speech... as you in rapid speech can be pronounced [9z ja] or [339]" (Lodge 1984:2). d b b b son cnt
— son - cnt
+ ant + cor
+ ant — cor
0
(Nathan, 1988:311) 205
Segment
The flavor of these quotes and analyses should be familiar. I think they are a fair reflection of the generally received approach to handling assimilation phenomena in English. They are remarkably uniform not only in what they say but in the ways they say it. Despite some differences in formulation and in representation, they tell the same old story: when particular things come into contact with each other one of those things changes into, or accommodates in shape to, the other. However, this position does beg a number of questions, for example: what are these "things"? (variously denoted: t ~ k ~ [ + cons, —ant, —cor] [1]; /s/ ~ [s] [3]; [mp] ~ np ~ [mp] ~ mp [4]; alveolars [5]; voiced ~ voiceless consonant [6]. Where are these "things"? - in the phonology or the phonetics or both or somewhere in between? Let us examine some of the assumptions that underlie what I have been referring to as "the standard story about assimilation." The following features represent the commonalities revealed by the quotations above: Concatenative-segmental: the accounts all assume strings of segments at some level of organization (phonological and/or phonetic). Punctual-contrastive: the accounts embody the notion that phonological distinctions and relationships, as realized in their phonetic exponents, are organized in terms of a single, unique cross-parametric, time slice. Procedural-destructive: as construed in these accounts, assimilation involves something which starts out as one thing, is changed/changes and ends up being another. These accounts are typical in being systematically equivocal with respect to the level(s) of description involved: no serious attempt is made to distinguish the phonetic from the phonological. Homophony-producing: these accounts all make the claim (explicitly or implicitly) that assimilated forms engender neutralization of oppositions. That is, in assimilation contexts, we find the merger of a contrast such that forms which in other contexts are distinct become phonetically identical.
8.4.3 Some acoustic and EPG data Although I have presented these commonalities as if they were separate and independent, it is easy to see how they might all be said toflowfrom the basic assumption that the necessary and sufficient form of representation is strings of segments (this clearly holds even for the autosegmental type of representation pictured here). But, as we have seen, there is no need to postulate segments at any level of organization: in the phonology we are dealing with unordered distinctions and oppositions at various places in structure; in the phonetics with parametric exponents of those oppositions. Thus the language used here in these descriptions of assimilation is at odds with what really needs to be said: expressions of the kind "alveolar consonants," 206
8 John Local
"bilabial consonant," "voiced/voiceless consonant" are not labels for phonological contrasts nor are they phonetic descriptors - at the best they are arbitrary, cross-parametric classificatory labels for IPA-type segmental reference categories. There are no "alveolar consonants" or "voiceless consonants" in the phonology of English; by positing such entities we simply embroil ourselves in a procedural-destructive-type account of assimilation because we have to change the "alveolar consonant" into something else (see Repp's discussion of appropriate levels of terminology in speech research [1981]). Nor is it clear that where these descriptions do make reference to the phonetics, the claims therein are always appropriate characterisations of the phenomena. I will consider two illustrative cases where spectrographic and electropalatographic (EPG) records (figures 8.1-8.4) suggest that, at least for some speakers, assimilated forms do not have the same characteristics as other "similar" articulatory/acoustic complexes. There is nothing special about these data. Nor would I wish to claim that the records presented here show the only possible pronunciation of the utterances under consideration. Many other things can and do happen. However, the claims I make and the analytic account I offer are in no way undermined by the different kinds of phonetic material one can find in other versions of the same kinds of utterance. The utterance from which these records was taken was In that case I'll take the black case.4 The data for these speakers is drawn from an extensive collection of connected speech gathered in both experimental and nonexperimental contexts at the University of York. In auditory impressionistic terms, the junction of the words that and case and the junction of black and case exhibit "stretch cohesion" marked by velarity for both speakers (K and L). However, there are also noticeable differences in the quality of the vocalic portion of that as opposed to black. These auditory differences find reflection in the formant values and trajectories in the two words. For K we see an overall higher F p F2, and F3 for that case as opposed to black case and a different timing/trajectory for the F2/F3 "endpoint." For L we see a similar difference in F p F2, and F3 frequencies. That is, the vocalic portion in the syllable closed with the "assimilatory" velarity does not lose its identity (it does not sound like "the same vowel" in the black syllable - it is still different from that in the syllable with the lexical velarity). It is, as it were, just the right kind of vocalic portion to have occurred as an exponent of the nucleus in a syllable where the phonological opposition at its end did not involve velarity. 4
I am indebted to the late Eileen Whitley (unpublished lecture notes, SOAS, and personal communication) for drawing my attention to this sentence. The observations I make derive directly from her important work. In these utterances that and black were produced as accented syllables. 207
Segment
(a) .,
(d) Figure 8.1 Spectrograms of (a) that case and (b) black case produced by speaker K and (c) that case and (d) black case produced by speaker L
208
8 John Local Speaker (L)
40
4!
42
54
55
56
43
44
45
46
47
48
49
50
51
52
47
48
49
50
51
52
53
57
that_case
40
41
54
55
42
43
44
45
4i
53
56
/5/ac/c case
Figure 8.2 Electropalatographic records for the utterances that case and black case produced by speaker L shown in figure 8.1
Examination of the EPG records (for L - seefig.8.2) reveals that for these same two pairs, although there is indeed velarity at the junction of that case and black case, the nature and location of the contact of tongue to back of the roof of the mouth is different. In black case we can see that the tongue contact is restricted to the back two rows of electrodes on the palate. By contrast, in that case the contact extends forward to include the third from back row (frames 43-5, 49-52) and for three frames (46-8) the fourth from back row; there is generally more contact, too, on the second from back row (frames 45-9) through the holding of the closure. Put in general terms, there is an overall fronter articulation for the junction of that case as compared with that for black case. Such auditory, spectrographic, and EPGfindingsare 209
Segment
routine and typical for a range of speakers producing this and similar utterances5 (though the precise frontness/backness relations differ [see Kelly and Local 1986]). Consider now the spectrograms in figure 8.3 of three speakers (A, L, W) producing the continuous utterance This shop's a fish shop.6 The portions of interest here are the periods of friction at the junction of this shop and fish shop. A routine claim found in assimilation studies is that presented earlier from Roach: "s becomes J, and z becomes 3 when followed by J or j . " Auditory impressionistic observation of these three speakers' utterances reveals that though there are indeed similarities between the friction at the junction of the word pairs, and though in the versions of this shop produced by these speakers it does not sound like canonical apico-alveolar friction, the portions are not identical and a reasonably careful phonetician would feel uneasy about employing the symbol J for the observed "palatality" in the this shop case. The spectrograms again reflect these observed differences. In each case the overall center of gravity for the period of friction in this shop is higher than that in fish shop (see Shattuck-Hufnagel, Zue, and Bernstein 1978; and Zue and Shattuck-Hufnagel 1980). These three speakers could also be seen to be doing different things at the junction of the word pairs in terms of the synchrony of local maxima of lip rounding relative to other articulatory components. For this shop the onset of lip rounding for all three speakers begins later in the period of friction than in fish shop, and it gets progressively closer through to the beginning of shop. In fish shop for these speakers lip rounding is noticeably present throughout the period of final friction in shop. (For a number of speakers that we have observed, the lip-rounding details discussed here are highlighted by the noticeable lack of lip rounding in assimilated forms of this year.) Notice too that, as with the that case and black case examples discussed earlier, there are observable differences in the Fx and F3 frequencies in the vocalic portions of this and shop. Most obviously, in this Fl is lower than in fish and F3 is higher (these differences show consistency over a number of productions of the same utterance by any given speaker). The vocalic portion in this even with "assimilated" palatality is not that of a syllable where the palatality is the exponent of a different (lexically relevant) systemic opposition. While the EPG records for one of these speakers, L (seefig.8.4), offer us no insight into the articulatory characteristics corresponding to the impressionistic and acoustic differences of the vocalic portions they do show 5
6
Kelly and Local (1989: part 4.7) discuss a range of phenomena where different forms are said to be homophonous and neutralization of contrast is said to take place. They show that appropriate attention to parametric phonetic detail forces a retraction of this view. See also Dinnsen (1983). In these utterances 'this' and 'fish' were produced as accented syllables. 210
8 John Local
iutiitiSif
ran
I Figure 8.3 Spectrograms of the utterances This shop's a fish shop produced by speakers W, A, and L
211
Segment 40 '"
41 !!!!!!
69
42 !!";;
70
A3 !!!!!!
71
44 !!!!!!
72
45 D"'!"
73
4i
47
.,
48,
„
4J
O!!!!'fl 00
74
75
76
this shop ('assimilated')
! I I lo! 82
6?
70
71
83
84
85
73
74
75
76
77
fish shop Figure 8.4 Electropalatographic records of the utterances this shop and fish shop produced by speaker L shown in figure 8.3
that the tongue-palate relationships are different in the two cases. The palatographic record of this shop shows that while there is tongue contact around the sides of the palate front-to-back throughout the period corresponding to the friction portion, the bulk of the tongue-palate contact is oriented towards the front of the palate. Compare this with the equivalent period taken fromfishshop. The patterning of tongue-palate contacts here is very different. Predominantly, we find that the main part of the contact is 212
8 John Local
oriented towards the back of the palate, with no contact occurring in the first three rows. This is in sharp contrast to the EPG record for this shop where the bulk of the frames corresponding to the period of friction show tonguepalate contact which includes the front three rows. Moreover, where this shop exhibits back contact it is different in kind from that seen for fish shop in terms of the overall area involved.7 The difference here reflects the difference in phonological status between the ("assimilatory") palatality at the end of this, an exponent of a particular kind of word juncture, and that of fish, which is an exponent of lexically relevant palatality. What are we to make of these facts? First, they raise questions about the appropriacy of the received phonetic descriptions of assimilation. Second, they lead us to question the claims about "neutralization" of phonological oppositions. (Where is this neutralization to be found? Why should vocalic portions as exponents of "the same vowel" be different if the closures they cooccur with were exponents of "the same consonant"?) Third, even if we were to subscribe to the concatenative-segment view, these phonetic differences suggest that whatever is going on in assimilation, it is not simply the replacement of one object by another. 8.5 A nonprocedural interpretation
Can we then give a more appropriate formulation of what is entailed by the notion "assimilation"? Clearly, there is some phenomenon to account for, and the quotations I cited earlier do have some kind of a ring of common sense about them. The least theory-committed thing we can say is that in the "assimilated" and "nonassimilated" forms (of the same lexical item) we have context-dependent exponents of the same phonological opposition (if it were not the same then presumably we would be dealing with a different lexical item). Notice, too, that this description fits neatly "consonant-vowel" coarticulation phenomena described earlier. And as with "coarticulation," we can see that, at least in the cases under consideration, we are dealing here not simply with some kind of context-dependent adjacency relationship defined over strings, but rather with a structare-dependent relationship; "assimilation" involves some constraint holding over codas and onsets, typically where these constituents are themselves constituents of a larger structure (see Cooper and Paccia-Cooper 1980; Scott and Cutler 1984; Local and Kelly 1986). Notice that the phonetic details of the "assimilated" consonant are specific to its assimilation environment in terms of the precise place of articulation, closure, and duration characteristics, and its relations with the exponents of the syllable nucleus. As I have suggested, the precise 7
Similar EPG results for "assimilations" have been presented by Barry (1985), Kerswill and Wright (1989), and Nolan (this volume). 213
Segment
cluster of phonetic features which characterize "assimilated consonants" do not, in fact, appear to be found elsewhere. And just as with coarticulation, we can give assimilation a nonprocedural interpretation. In order to recast the canonical procedural account of assimilation in rather more "theoryneutral" nonprocedural fashion we need minimally: to distinguish phonology from phonetics - we do not want to talk of the phonology changing. Rather we want to be able to talk about different structure-dependent phonetic interpretations of particular coda-onset relations; a way of representing and accessing constituent-structure information about domains over which contrasts operate: in assimilation, rime exponents are maintained; what differences there are are to be related to the phonetic interpretation of coda; a way of interpreting parameters (in temporal and quality terms) for particular feature values and constellations of feature values. The approach to phonology and phonetics which I described earlier (section 8.3) has enabled us at York to produce a computer program which embodies all these prerequisites. It generates the acoustic parameters of English monosyllables (and some two-syllable structures) from structured phonological representations. The system is implemented in Prolog - a programming language particularly suitable for handling relational problems. This allows successful integration of, for example, feature structures, operations on graphs, and unification with specification of temporal relations obtaining between the exponents of structural units. In the terms sketched earlier, the program formally and explicitly represents a phonology which is nonsegmental, abstract, structured, monostratal, and monotonic. Each statement in the program reflects a commitment to a particular theoretical position. Socalled "process" phenomena such as "consonant-vowel" coarticulation are modeled in exactly the way described earlier; exponents of onset constituents are overlaid on exponents of syllable-rime-nucleus constituents. In order to do this it is necessary to have a formal and explicit representation of phonological relations and their parametric phonetic exponents. Having such parametric phonetic representations means that (apart from not operating with cross-parametric segmentation) it becomes possible to achieve precise control over the interpretation of different feature structures and phonetic generalizations across feature structures. As yet, I have had nothing to say about how we cope with the destructiveprocess (feature-changing) orientation of the accounts I have considered so far. I will deal with the "process" aspect first. A moment's consideration reveals a very simple declarative, rather than procedural, solution. What we want to say is something very akin to the overlaying in coarticulation discussed above. Just as the onset and rime are deemed to share particular 214
8 John Local
vocalic characteristics so, in order to be able to do the appropriate phonetic interpretation of assimilation, we want to say that the coda and onset share particular features. Schematically we can picture the "sharing" in an "assimilation context" such as that case as follows: (13) a, exponents coda, exponents
a 2 exponents onset2 exponents
velarity exponents
We can ensure such a sharing by expressing the necessary constraint equation over the coda-onset constituents, e.g.: (14)
syllable
syllable.. rime
onset coda,
a, exponents
a2 exponents coda, = onset, velarity exponents
While this goes some way to removing the process orientation by stipulating the appropriate structural domain over which velarity exponents operate, there is still the problem of specifying the appropriate feature structures. How can we achieve this sharing without destructively changing the feature structure associated with codaj? At first glance, this looks like a problem, for change the structure we must if we are going to share values. Pairs of feature structures cannot be unified if they have conflicting information - this would certainly be the case if the codat has associated with it a feature structure with information about (alveolar) place of articulation. However, if we consider the phonological oppositions operating in system at this point in structure we can see that alveolarity is not relevantly statable. If we examine the phonetic 215
Segment
exponents of the coda in various pronunciations of that, for example, we can see that the things which can happen for "place of articulation" are very different from the things which can occur as exponents of the onset in words such as tot, dot, and not. In the onsets of tof-type words we do not find variant, related pronunciations with glottality (e.g. ?h ?-) or labiality (e.g. p~: ?w~) or velarity (e.g. k~ ?k- :).
The reason for this is that at this (onset) plaee in structure alveolarity versus bilabiality versus velarity are systemically in opposition. The obvious way out of the apparent dilemma raised by assimilation, then, is to treat the (alveolar) place of articulation as not playing any systemically distinctive role in the phonology at this (coda) place in structure. This is in fact what Lodge (1984) proposes, although he casts his formulation within a process model. He writes: "In the case of the alveolars, so-called, the place of articulation varies considerably, as we have already seen. We can reflect this fact by leaving /t/, /d/ and /n/ unspecified for place and having process and realization rules supply the appropriate feature" (1984: 123). And he gives the following rewrite-rule formulation under the section headed "The process rules": 1 If
[stop] [ 0place ]
then [ aplace ]/ 2 If
[ aplace ]
[stop] [ 0place ]
then [ alv ]
Although Lodge (1984: 131) talks of "realization rules" (rule 2 here), the formulation here, like the earlier formulations, does not distinguish between the phonetic and the phonological (it again trades on a putative, naive interpretation of the phonological features). "Realization" is not being employed here in a "phonetic-interpretation" sense; rather, it is employed in the sense of "feature-specification defaults" (see Chomsky and Halle 1968; Gazdar et al 1985). The idea of not specifying some parts of phonological representations has recently found renewed favor under the label of "underspecification theory" (UT) (Archangeli 1988). However, the monolithic asystemic approach sub216
8 John Local
sumed under the UT label is at present too coarse an instrument to handle the phenomena under consideration here (because of its across-the-board principles, treating the codas of words like bit, bid and bin as not being specified for "place of articulation" would entail that all such "alveolars" be so specified; as I will show, though, the same principle of not specifying some part of the phonological representation of codas of words such as this and his is appropriate to account for palatality assimilation as described earlier, it is not "place" of articulation that is involved in these cases). Within the phonological model described earlier, we can give "un(der)specification of place" a straightforward interpretation. One important characteristic of unification-based formalisms employing feature structures of the kind I described earlier (and which are implemented in our model) is their ability to share structure. That is, two or more features in a structure may share one value (Shieber 1986; Pollard and Sag 1987). When two features share one value, then any increment in information about that value automatically provides further specification for both features. In the present case, what we require is that coda-constituent feature structures for bit, bid, and bin, for instance, should have "place" features undefined. A feature whose value is undefined is denoted: [feat: 0]. Thus we state, in part, for coda exponents under consideration: (15) grv: 0 cmp: 0 This, taken with the coda-onset constraint equation above, allows for the appropriate sharing of information. We can illustrate this by means of a partial feature structure corresponding to // cut: (16) src: coda:
cons: |_J_J voc:
src: onset:
cons: 11 1fgrv:
+
cmp:
4-
voc: 277
Segment
The sharing of structure is indicated by using integer coindexing in featurevaiue notation. In this feature structure CD indexes the category [grv: -f, cmp: +], and its occurrence in [coda; [cons: Q]]] indicates that the same value is here shared. Note that this coindexing indicating sharing of category values at different places in structure is not equivalent to multiple occurrences of the same values at different places in structure. Compare the following feature structure: (17) src: coda:
cons:
grv:
voc:
src: onset:
cons:
grv:
voc:
This is not the same as the preceding feature structure. In the first case we have one shared token of the same category [cons: [grv: +, cmp: +]], whereas in the second there are two different tokens. While the first of these partial descriptions corresponds to the coda-onset aspects of it cut, the second relates to the coda-onset aspects of an utterance such as black case. By sharing, then, we mean that the attributes have the same token as their value rather than two tokens of the same type. This crucial difference allows us to give an appropriate phonetic interpretation in the two different cases (using the extant exponency data in the synthesis program). As the standard assimilation story reflects, the phonological coda oppositions which have, in various places, t, d, n as their exponents share certain commonalities: their exponents can exhibit other than alveolarity as place of articulation; in traditional phonetic terms their exponents across different assimilation contexts have manner, voicing, and source features in common. 218
8 John Local
Thus the exponents of the coda of ten as in ten peas, ten teas, ten keys all have as part of their make-up voice, closure, and nasality; the exponents of the coda of it in it tore, it bit, it cut all have voicelessness, closure, and nonnasality as part of theirs. In addition, the "coarticulation" of the assimilated coda exponents binds them to the exponents of the nucleus constituent in the same syllable. At first glance, the situation with codas of words such as this and his might appear to be the same as that just described. That is, we require underspecification of "place of articulation." In conventional accounts (e.g. Roach above and Gimson 1970) they are lumped for all practical purposes with the "alveolars" as an "assimilation class." In some respects they are similar, but in one important respect the exponents of these codas behave differently. Crucially, we do not find the same place of articulation phenomena here that wefindwith the bit, bid, and bin codas. Whereas in those cases we find variants with alveolarity, labiality, velarity, and dentality, for example, we do not find the same possibilities for this and his type of codas. For instance, in our collection of material, we have not observed (to give broad systemic representations) forms such as: (18)
6iffif difman digdsn diOOir)
for this fish, this man, this then, and this thing. However, we do find "assimilated" versions of words such as this and his with palatality (as described above, e.g. this shop, this year). In these cases we do not want to say that the values of the [ens] part of the coda feature-structure are undefined in their entirety. Rather, we want to say that we have sharing of "palatality." (Although the production of palatality as an exponent of the assimilated coda constituent of this may appear to be a "place-of-articulation" matter, it is not [see Lass 1976: 4.3, for a related, though segmental, analysis of alveolarity with friction, on the one hand, and palato-alveolarity with friction, on the other]). The allocation of one kind of "underspecification" to the codas of bit, bid, and bin forms and another to the codas of this and his forms represents part of an explicit implementation of the Firthian polysystemic approach to phonological interpretation. Given the way we have constructed our feature descriptions, the partial representation for the relevant aspects of such feature structures will look like this: (19) grv:
grv:
cons:
cons: emp:
emp:
219
Segment
U here denotes the unification of the two structures. This will give us the large structure corresponding to the "assimilated" version of this shop: (20)
coda:
grv: cons:
cmp:
voc:
onset:
grv:
+
cmp:
+
Once we have done this, it is possible to give again an appropriate kind of phonetic interpretation such that the synthesized tokens of this shop and fish shop have appropriate kinds of friction portion associated with them (for a given particular version a particular kind of "assimilation" piece). Because we have a componential parametric phonetic representation, sharing of "compactness" across the exponents of structure, as described here, is a trivial task. Compare the spectrograms of the synthesis output below of tokens of this with fish, and this shop ("unassimilated" and "assimilated") with this and with fish shop. (The unassimilated version of this shop was generated by not allowing the sharing of compactness across coda-onset constituents in the description of the structural input to the synthesis program.) A number of points are worth comment. First, the periods of friction corresponding to the "assimilated" coda in this and the coda offish are not identical, nor are the portions corresponding to the vocalic part of these syllables. By contrast there are clear similarities between the portions corresponding to the vocalic part of the "unassimilated" and "assimilated" versions of this and the isolated form of this. Compare these spectrograms with the natural-speech versions in figure 8.3 above. The synthesis output, 220
8 John Local
Figure S.5(a) and (b). For caption see page 223. 221
Segment
hi M l A Milt
Figure 8.5(c) and (d). For caption see facing page. 222
8 John Local
Figure 8.5 Spectrograms of synthetic versions of (a) this, (b) fish, (c) this shop (unassimilated), (d) this shop (assimilated), and (e)fish shop
while differing in many details from the natural versions, nonetheless has in it just the same kinds of relevant similarities and differences. 8.6 Conclusion
I have briefly sketched an approach to phonological representation and parametric phonetic interpretation which provides an explicit and computationally tractable model for nonsegmental, nonprocedural, (rewrite-)rulefree speech synthesis. I have tried to demonstrate that it is possible to model "process" phenomena within this approach in a nonderivational way that provides a consistent and coherent account of particular assimilatory aspects of speech. The success of this approach in high-quality speech synthesis suggests that this model of phonological organization will repay extensive further examination.
223
Segment
Comments on chapter 8 KLAUS KOHLER Summary of Local's position
The main points of Local's paper are: 1 His phonological approach does not admit of segmental entities at any level of representation. 2 It distinguishes strictly between phonology and phonetics. Phonology is abstract. It is monostratal, i.e. there is only one level of phonological representation. There are no derivational steps, therefore no processes (e.g. the conversion of a high tone into a rising tone following a low tone). Instead, phonological categories have phonetic exponents, which are descriptions of physical, temporal events formulated in the physical domain, i.e. phonetic representations in terms of component parameters and their synchronization in time. The precise form of phonetic representations has no bearing on the form of phonological representations. Phonological representations are structured; there is no time or sequence in them, but only places in structure. Phonology deals with unordered labeled graph-objects instead of linearly ordered strings of symbols. Talking about sequence only makes sense in terms of temporal phonetic interpretation. Feature structures must be distinguished from parametric phonetic representations in time. Deletion and insertion are treated as different kinds of parameter synchronization in the phonetic interpretation of the phonological representation. Phonological operations are monotonic; i.e. the combinatorial operations with which phonological structures are constructed can only add information to representations and cannot remove anything. 3 As a corollary to the preceding, there are no rewrite rules. The assimilations of, for example, alveolars to following labials and velars at word boundaries in English are not treated as changes in phonology, because there are no "alveolar consonants" in the phonology of English; "by positing such entities we simply embroil ourselves in a procedural-destructive-type account of assimilation because we have to change the 'alveolar consonant' into something else" (p. 207). Quite apart from this there is no homophony in, e.g., that case and black case or this shop andfishshop, so the empirical facts are not reported correctly.
224
8 Comments Critical comments
My reply to Local's paper follows the line of arguments set out above. 1 If phonological elements are nonsegmental and unordered, only structured, and if phonetic exponency shows ordering in time, then Local has to demonstrate how the one is transformed into the other. He has not done this. To say that component parameters are synchronized in time in phonetic representations is not enough, because we require explicit statements as to the points in sequence where this synchronization occurs, where certain parameters take on certain values. Local does not comprehensively state how the generation of acoustic parameters from structured phonological representations is achieved. What is, for instance the input into the computer that activates these exponency files, e.g. in black case and that easel Surely Local types in the sequence of alphabetic symbols of English spelling, which at the same time reflects a phonetic order: k and t come before c of case. This orthographic sequence is then, presumably, transformed into Local's phonological structures, which, therefore, implicitly contain segmental-order information because the structures are derived from a sequential input. So the sequential information is never lost, and, consequently, not introduced specially by the exponency files activated in turn by the structural transforms. When the exponency files are called upon to provide parametric values and time extensions the segmental order is already there. Even if Local denies the existence of segments and sequence in phonology, his application to synthesis by rule in his computer program must implicitly rely on it. 2 If phonology is strictly separated from phonetics, and abstract, how can timeless, unordered, abstract structures be turned into ordered, concrete time courses of physical parameters? Features must bear very direct relations to parameters, at least in a number of cases at focal points, and there must be order in the phonological representations to indicate whether parameter values occur sooner or later, before or after particular values in some other parameter. Moreover, action theory (e.g. Fowler 1980) has demonstrated very convincingly that temporal information must be incorporated in phonological representations. This is also the assumption Browman and Goldstein (1986, this volume) work on. And many papers in this volume (like Firth himself, e.g. Firth 1957) want to take phonology into the laboratory and incorporate phonetics into phonology. The precise form of phonetic representations does indeed have a bearing on the form of phonological representations (cf. Ohala 1983). For example, the question as to why alveolars are assimilated to following labials and velars, not vice versa, and why labials and velars are not assimilated to each other in English or 225
Segment German, finds its answer in the phonetics of speech production and perception, and the statement of this restriction of assimilation must be part of phonology (Kohler 1990). If the strict separation of phonetics and phonology, advocated by Local, is given up, phonology cannot be monostratal, and representations have to be changed. It is certainly not an empirical fact that information can only be added to phonological representations, never removed. The examples that case/black case and this shop/fish shop do not prove the generality of Local's assertion. We can ask a number of questions with regard to them: 1 How general is the distinction? The fact that it can be maintained is no proof that is has to be maintained. The anecdotal reference to these phrases is not sufficient. We want statistical evaluations of a large data base, not a few laboratory productions in constructed sentences by L and K, which may even stand for John Local and John Kelly. 2 Even if we grant that the distinction is maintained in stressed that case vs. stressed black case, what happens in It isn't in the bag, it's in that easel 3 What happens in the case of nasals, e.g. in You can get it or pancake? 4 What happens in other vowel contexts, e.g. in hot cooking? 5 In German mitkommen, mitgehen the traces of /t/ are definitely removable, resulting in a coalescence with the clusters in zuruckkommen, zuruckgehen; similarly, ankommen/langkommen, angehen/langgehen. 6 Even if these assimilations were such that there are always phonetic traces of the unassimilated structures left, this cannot be upheld in all cases of articulatory adjustments and reductions. For instance, German mit dem Auto can be reduced to a greater or lesser extent in the function words mit and dem. Two realizations at different ends of the reduction scale are: [mit de-m '?aoto:] [mim 'Paotoi] There is no sense in saying that the syllable onset and nucleus of [ deTm] are still contained in [m], because the second utterance has one syllable less. If, however, the two utterances are related to the same lexical items and a uniform phonological representation, which is certainly a sensible thing to do, then there has to be derivation and change. And this derivation has to explain phonologically and phonetically why it, rather than any other derivation, occurs. Allowing derivations can give these insights, e.g. for the set of German weak form reductions [mit deTm], (mit ctam], [mitm], [mipm], [mibm], [mimm], [mim] along a scale from least reduced and most formal to most reduced and least formal, which can be accounted for by a set of ordered rules explaining these changes with reference to general phonetic principles, and excluding all others (Kohler 1979). 3 Rewrite rules are thus not only inevitable, they also add to the explanatory power of our phonological descriptions. This leads me to a further question. 226
8 Comments
What are phonologies of the type Local proposes useful for? He does not deal with this issue, but I think it is fair to say that he is not basically concerned with explanatory adequacy, i.e. with the question as to why things are the way they are. His main concern is with descriptive adequacy, and it is in this respect that the acuteness of phonetic observations on prosodic lines can contribute a lot, and has definitely done so. This is the area where prosodic analysis should continue to excel, at the expense of the theorizing presented by Local.
Comments on chapter 8 MARIO ROSSI The phonology-phonetics distinction
I agree with Local on the necessity of a clear distinction between the two levels of matter (substance) and form. But it seems to me that his conception is an old one derived from a misinterpretation of De Saussure. Many of the structuralists thought that the concept of language in De Saussure was defined as a pure form; but an accurate reading of De Saussure, whose ideas were mostly influenced by Aristotle's Metaphysics, shows that the "langue" concept is a compound of matter and form, and that the matter is organized by the form. So it is the organization of acoustic cues in the matter, which is a reflection of the form, that allows in some way the perception and decoding of the linguistic form. Consequently, Local's assumption, "it makes no sense to talk of structure or systems in phonetics" (p. 193), is a misconception of the relationship between matter and form. Matter is not an "amorphous" substance. The arbitrary relationship between matter and form (I prefer "matter and form" to "phonetics and phonology") means that the parameters of the matter and the way in which they are organized are not necessarily linked with the form; that is, the form imposes an organization on the parameters of the matter according to the constraints and the specific modes of the matter. So we are justified in looking for traces of form values in the matter. At the same time, we have to remember that the organization of matter is not isomorphic with the structure of the form. In other words, the acoustic/ articulatory cues are not structured as linguistic features, as implied in Jakobson, Fant, and Halle (1952). In that sense, Local is right when he says "phonological features do not have implicit denotations" (p. 196). In reality, the type of reasoning Local uses in the discussion of the assimilation process in this shop and fish shop demonstrates that he is looking at the acoustic 227
Segment parameters as organized parameters, parameters that reflect the formal structure. I agree in part with the assumption that the search for invariant phonetic correlates of phonological categories is implied by the erroneous belief that the phonological categories have some kind of implicit "naive phonetic denotation." However, I think that the search for invariants by some phoneticians is not "naive," but more complexly tied to two factors: 1 The structuralist conception of language as pure form, and the phonology derived from this conception. In this conception a phonological unit does not change: "once a phoneme always a phoneme." 2 The lack of a clear distinction in American structuralism between cues and features, derived from the misconception that emphasizes the omnipotence of the form embedded in the matter. Finally, to say that "the relationship between phonology and phonetics is arbitrary" (p. 193) is to overlook the theory of natural phonology. The concept of the arbitrary relationship needs to be explained and clarified. Consider Hooper's (1976) "likelihood condition"; if this condition did not hold, the phonological unit would have phonetic exponents that would be totally random. Underspecification How can an "unspecified" coda affect the onset exponents of the structure (p. 215). Local defines some codas as "underspecified"; he says that phonological descriptions are algebraic objects and the precise form of phonetic representation has no bearing on the form of the phonological representations. I see a contradiction in this reasoning: underspecification is posited in order to account for phonetic assimilation processes. So phonetic representations have bearing on form! I see no difference between underspecification and the neutralization that Local wants to avoid.
228
9 Lexical processing and phonological representation ADITI LAHIRIand WILLIAM MARSLEN-WILSON
9.1 Introduction
In this paper, we are concerned with the mental representation of lexical items and the way in which the acoustic signal is mapped onto these representations during the process of recognition. We propose here a psycholinguistic model of these processes, integrating a theory of processing with a theory of representation. We take the cohort model of spoken wordrecognition (Marslen-Wilson 1984, 1987) as the basis for our assumptions about the processing environment in which lexical processing takes place, and we take fundamental phonological assumptions about abstractness as the basis for our theory of representation. Specifically, we assume that the abstract properties that phonological theory assigns to underlying representations of lexical form correspond, in some significant way, to the listener's mental representations of lexical form in the "recognition lexicon," and that these representations have direct consequences for the way in which the listener interprets the incoming acoustic-phonetic information, as the speech signal is mapped into the lexicon. The paper is organized as follows. We first lay out our basic assumptions about the processing and representation of lexical form. We then turn to two experimental investigations of the resulting psycholinguistic model: the first involves the representation and the spreading of a melodic feature (the feature [nasal]); and the second concerns the representation of quantity, specifically geminate consonants. In each case we show that the listener's performance is best understood in terms of very abstract perceptual representations, rather than representations which simply reflect the surface forms of words in the language.
229
Segment 9.2 Outline of a theory of lexical processing and representation
In the following two sections we will outline our basic assumptions about the abstract nature of mental representations in the recognition lexicon, and about the processing environment within which these representations have their perceptual function. 9.2.1 Assumptions about representation
The psycholinguistically relevant representations of lexical form must be abstract in nature - that is, they must in some way abstract away from the variabilities in the surface realization of lexical form. This means that our account of the properties of the recognition lexicon must be an account in terms of some set of underlying mental representations. The only systematic hypotheses about the properties of these underlying representations are those that derive from phonological theory, and it is these that we take as our starting point here. We do not assume that there is a literal and direct translation from phonological analysis to claims about mental representation, but we do assume that there is a functional isomorphism between the general properties of these mental representations and the general properties of lexical-form representations as established by phonological analysis. From the consensus of current opinion about phonological representations we extract three main assumptions. First, we assume hierarchical representation of features, such that underlying segments are not defined as unordered lists. Such a featural organization attributes features to different tiers and allows the relationships between two features or sets of features to be expressed in terms of links between the tiers (Clements 1985). An advantage of hierarchical organization of features is that it is possible to express the fact that certain groups of features consistently behave as functional units. From the perspective of lexical access and the recognition lexicon, this means that the input to the lexicon will need to be in terms of features rather than segments, since there is no independent level of representation corresponding to segments which could function as the access route to the lexicon. The second assumption concerns the representation of quantity. The autosegmental theory of length has assumed that the feature content of the segment is represented on a different level (the melody tier) from the quantity of the segment. Long segments, both vowels and consonants, are single melody units containing all featural information doubly linked to two abstract timing units on the skeletal tier (see Hayes 1986). Researchers differ as to the exact nature of the unit of representation (see McCarthy and Prince 1986; Lahiri and Koreman 1988; Hayes 1989), but for our purposes, it is sufficient to assume that in the representation of lexical items in the mental 230
9 Aditi Lahiri and William Marslen- Wilson
lexicon, the featural content is separate from its quantity. This has implications for the perceptual processing of quantity as opposed to quality information, as we discuss in more detail in section 9.4 below. The third assumption concerns the amount of featural content present in the underlying representation. A theory of feature specification must determine what features are present underlyingly and which values of these features are specified. We assume here an underspecified lexicon, where only unpredictable information is represented. First, only distinctive features which crucially differentiate at least two segments are present, and second, only one value of a feature (the marked value) is specified. These claims are by no means uncontroversial. However, a full discussion of the different views on underspecification is beyond the scope of this paper. Since we are primarily concerned with the feature [nasal] in the research we present here, we will discuss later in the paper the representation of this feature in the relevant languages. From the perspective of the processing model and the recognition lexicon, these assumptions mean that the lexical representations deployed in speech recognition will contain only distinctive and marked information. The crucial consequence of this, which we will explore in the research described below, is that the process of lexical access will assign a different status to information about nondistinctive, unmarked properties of the signal than it will to information that is directly represented in the recognition lexicon. This means that neutralized elements (at least, those that arise from postlexical feature-spreading rules) are interpreted with respect to their underlying representation and not their surface forms. 9.2.2 Assumptions about processing
To be able to evaluate claims about representation in a psycholinguistic framework, they need to be interpreted in the context of a model of the processing environment for lexical access. The theory that we assume here is the cohort model of spoken-word recognition (Marslen-Wilson 1984, 1987). The salient features of this model are as follows. The cohort model distinguishes an initial, autonomous process of lexical access and selection, responsible for the mapping of the speech signal onto the representations of word forms in the mental lexicon. The model assumes that there is a discrete, computationally independent recognition element for each lexical unit. This unit represents the functional coordination of the bundle of phonological, morphological, syntactic, and semantic properties defining a given lexical entry. Here we are concerned only with the phonological aspects of the representation. The "recognition lexicon," therefore, is constituted by the entire array of such elements. 231
Segment
A second property of the system is that it allows for the simultaneous, parallel activation of each lexical element by the appropriate input from the analysis of the acoustic signal. This is coupled with the further assumption that the level of activation of each element reflects the goodness of fit of the input to the form specifications for each element. As more matching input accumulates, the level of activation will increase. When the input pattern fails to match, the level of activation starts to decay. These assumptions lead to the characteristic cohort view of the form-based access and selection process. The process begins with the multiple access of word candidates as the beginning of the word is heard. All of the words in the listener's mental lexicon that share this onset sequence are assumed to be activated. This initial pool of active word candidates forms the "word-initial cohort," from among which the correct candidate will subsequently be selected. The selection decision itself is based on a process of successive reduction of the active membership of the cohort of competitors. As more of the word is heard, the accumulating input pattern will diverge from the form specifications of an increasingly higher proportion of the cohort membership. This process of reduction continues until only one candidate remains still matching the speech input - in activation terms, until the level of activation of one recognition element becomes sufficiently distinct from the level of activation of its competitors. At this point the form-based selection process is complete, and the word form that best matches the speech input can be identified. For our current concerns, the most important feature of this processing model is that it is based on the concept of competition among alternative word candidates. Perceptual choice, in the cohort approach, is a contingent choice. The identification of any given word does not depend simply on the information that this word is present. It also depends on the information that other words are not present, since it is only at the point in the word where no other words fit the sensory input - known as the "recognition point" - that the unique candidate emerges from among its competitors. The recognition of a word does not depend on the perceptual availability of a complete specification of that word in the sensory input, either where individual segments or features are concerned, or where the whole word is concerned. The information has to be sufficient to discriminate the word from its competitors, but this is a relative concept. This makes the basic mode of operation of the recognition system compatible with the claim that the representations in the recognition lexicon contain only distinctive information. This will be sufficient, in a contingent, competition-based recognition process, to enable the correct item to be recognized. The second important aspect of a cohort approach to form-based process232
9 Aditi Lahiri and William Marslen- Wilson
ing is its emphasis on the continuous and sequential nature of the access and selection process. The speech signal is based on a continuous sequence of articulatory gestures, which result in a continuous modulation of the signal. Recent research (Warren and Marslen-Wilson 1987, 1988) shows that this continuous modulation of the speech signal is faithfully tracked by the processes responsible for lexical access and selection. As information becomes available in the signal, its consequences are immediately felt at the lexical level. This means that there is no segment-by-segment structuring of the relationship between the prelexical analysis of the speech signal and the interpretation of this analysis at the level of lexical choice. The system does not wait until segmental labels can be assigned before communicating with the lexicon. Featural cues start to affect lexical choice as soon as they become available in the speech input. The listener uses featural information to select words that are compatible with these cues, even though the final segment cannot yet be uniquely identified (Warren and Marslen-Wilson 1987, 1988). On a number of counts, then, the cohort view of lexical access is compatible with the phonological view of lexical representation that we outlined earlier. It allows for an on-line process of competition between minimally specified elements, where this minimal specification is still sufficient to maintain distinctiveness; and second, it allows for this competition to be conducted, with maximal on-line efficiency, in terms of a continuous stream of information about the cues that the speech signal provides to lexical identity, where these cues are defined in featural terms.1 Given this preliminary sketch of our claims about lexical representation in the context of a model of lexical processing, we now turn to a series of experimental investigations of the psycholinguistic model that has emerged. 9.3 Processing and representation of a melodic feature
Our fundamental claim is that the processes of lexical access and selection are conducted with respect to abstract phonological representations of lexical form. This means that listeners do not have available to them, as they process the speech input, a representation of the surface phonetic realization of a given word form. Instead, what determines their performance is the underlying mental representation with respect to which this surface string is being interpreted. We will test this claim here in two ways, in each case asking whether information in the speech signal is interpreted in ways that follow from the 1
The Trace model (McClelland and Elman 1986) is an example of a computationally implemented model with essentially these processing properties - although, of course, it makes very different assumptions about representation. 233
Segment
claims that we have made about the properties of the underlying representations. In the first of these tests, to which we now turn, we investigate the interpretation of the same surface feature as its underlying phonological status varies across different languages. If it is the underlying representation that controls performance, then the interpretation of the surface feature should change as its underlying phonological status changes. In the second test, presented in section 9.4, we investigate the processing and representation of quantity. 9.3.1 The orallnasal contrast in English and Bengali
The feature that we chose to concentrate on was the oral/nasal contrast for vowels. This was largely because of the uncontroversial status of the feature [nasal] in natural languages. Nasal vowels usually occur in languages which have the oral counterpart, and are thus considered to be marked. In terms of underlying feature specification, distinctive nasal vowels are assumed to be marked underlyingly as [ + nasal], whereas the oral vowels are left unspecified (e.g. Archangeli 1984). The second reason for choosing the feature [nasal] was because the presence of vowel nasalization does not necessarily depend on the existence of an underlyingly nasal vowel. A phonetically nasal vowel can reflect an underlying nasal vowel, or it can be derived, by a process of assimilation, from a following nasal consonant. Such regressive nasal-assimilation processes are widespread cross-linguistically. This derived nasalization gives us the basic contrasts that we need, where the same surface feature can have a varying phonological status, contrasting both within a given language and between different languages. This leads to the third reason for choosing the feature [nasal]: namely, the availability of two languages (English and Bengali) which allowed us to realize these contrasts in the appropriate stimulus sets. The relevant facts are summarized in table 9.1. English has only underlying oral vowel segments. In the two cases illustrated (CVC and CVN) the vowel is underlyingly oral. An allophonic rule of vowel nasalization nasalizes all oral vowels when followed by a nasal consonant, giving surface contrasts like ban [baen] and bad [baed]. In autosegmental terms, assimilation is described as spreading where an association line is added. This is illustrated in the differing surface realization of the oral vowels in the CVNs and the CVCs in table 9.1. The assumption that the vowel in the CVN is underlyingly oral follows from the fact that surface nasalization is always predictable, and therefore need not be specified in the abstract representation. Bengali has both underlyingly oral and nasal vowel segments. Each of the 234
9 Aditi Lahiri and William Marslen- Wilson Table 9.1 Underlying and surface representation in Bengali and English Bengali CVN V
Underlying
C
V
cvc
C
CVN V C \\ \ [ -Knas ]
V
CVN V
C
cvc
CVC C
V
C
[ + nas ]
English Underlying
cvc
[ + nas ]
[ + nas ] Surface
V
C
V
cvc
C
[ + nas] Surface
V
V
C
C
[ -Knas ] seven oral vowels in the language has a corresponding nasal vowel, as in the minimal pairs [pak] "slime" and [pak] "cooking" (Ferguson and Chowdhury 1960). A postlexical process of regressive assimilation is an additional source of surface nasalization applying to both monomorphemic and heteromorphemic words.2 /kha + o/ /kha + n/ /kha + s/ /kan/
-> -> ->
[khao] [khan] [khas] [kan]
"you (familiar) eat" "you (honorific) eat" "you (familiar, younger) eat" "ear"
We make two assumptions regarding the specification of the [nasal] feature in the underlying representation of Bengali. First, only one value is specified, in this instance [ +nasal]; second, the vowels in monomorphemic CVN words, which always surface as CVN, are underlyingly oral and are therefore not specified for nasality. The surface nasality of these vowels is entirely 2
Our experiments on Bengali nasalization in fact only use monomorphemic words; we illustrate here the application of the nasal-assimilation rule on heteromorphemic words to show how it works in the language in general. 235
Segment
predictable, and since the rule of nasal assimilation is independently needed for heteromorphemic words (as illustrated above), our assumptions about underspecified underlying forms (especially with unmarked values [Kiparsky 1985: 92]) suggest that for all monomorphemic VN sequences, the underlying representation of the vowel segment should be an unmarked oral vowel. Note that the experiment reported below uses only monomorphemic CVN sequences of this type. Thus, assuming that the vowels in CVNs are underlyingly oral, and given the nasal-assimilation rule, this leads to the situation shown in the upper half of table 9.1. Surface nasality is ambiguous in Bengali since the vowels in both CVNs and in CVCs are realized as nasal. Unlike English, therefore, the nasal-assimilation rule in Bengali is neutralizing, and creates potential ambiguity at the lexical level. 9.3.2 Experimental predictions The pattern of surface and underlying forms laid out in table 9.1 allows us to discriminate directly the predictions of a theory of the type we are advocating - where underlying mental representations are abstract and underspecified from alternative views, where the recognition process is conducted in terms of some representation of the surface form of the word in question.3 Note that the surface-representation theory that we are assuming here will itself have to abstract away, at least to some degree, from the phonetic detail of a word's realization as a spoken form. We interpret "surface representation," therefore, as being equivalent to the representation of a word's phonological form after all phonological rules have applied - in the case of assimilatory processes like derived nasalization, for example, after the feature [nasal] has spread to the preceding consonant. Surface representation means, then, the complete specification of a word's phonetic form, but without any of the details of its realization by a given speaker in a given phonetic environment. The differences in the predictions of the surface and the underlying hypotheses only apply, however, to the period during which the listener is hearing the oral or nasal vowel (in monosyllables of the type illustrated in table 9.1). Once the listener hears the following consonant, then the interpretation of the preceding vowel becomes unambiguous. The experimental task that we will use allows us to establish how the listener is responding to the vowel before the consonant is heard. This is the gating task (Grosjean 1980; Tyler and Wessels 1983), in which listeners are presented, at successive trials, with gradually incrementing information 3
This view is so taken for granted in current research on lexical access that it would be invidious to single out any single exponent of it. 236
9 Aditi Lahiri and William Marslen- Wilson
about the word being heard. At each increment they are asked to say what word they think they are hearing, and this enables the experimenter to determine how the listener is interpreting the sensory information presented up to the point at which the current gate terminates. Previous research (Marslen-Wilson 1984, 1987) shows that performance in this task correlates well with recognition performance under more normal listening conditions. Other research (Warren and Marslen-Wilson 1987) shows that gating responses are sensitive to the presence of phonetic cues such as vowel nasalization, as they become available in the speech input.4 We will use the gating task to investigate listeners' interpretations of phonetically oral and phonetically nasal vowels for three different stimulus sets. Two sets reflect the structure laid out in table 9.1: a set of CVC, CVN, and CVC triplets in Bengali, and a set of CVC and CVN doublets in English. To allow a more direct comparison between English and Bengali, we will also include a set of Bengali doublets, consisting of CVN and CVC pairs, where the lexicon of the language does not contain a CVC beginning with the same consonant and vowel as the CVN/CVC pair. This will place the Bengali listeners, as they heard the CVN stimuli, in the same position, in principle, as the English listeners exposed to an English CVN. In each case, the item is lexically unambiguous, since there are no CVC words lexically available. As we will lay out in more detail in section 9.3.4 below, the underlyingrepresentation hypothesis predicts a quite different pattern of performance than any view of representation which includes redundant information (such as derived nasalization). In particular, it predicts that phonetically oral vowels will be ambiguous between CVNs and CVCs for both Bengali and English, whereas phonetically nasal vowels will be unambiguous for both languages but in different ways - in Bengali vowel nasalization will be interpreted as reflecting an underlying nasal vowel followed by an oral consonant, while in English it will be interpreted as reflecting an underlying oral vowel followed by a nasal consonant. Notice that these predictions for nasalized vowels also follow directly from the cohort model's claims about the immediate uptake of information in the speech signal - as interpreted in the context of this specific set of claims about the content of lexical-form representations. For the Bengali case, vowel nasalization should immediately begin to be interpreted as information about the vowel currently being heard. For the English case, where the 4
Ohala (this volume) raises the issue of "hysteresis" effects in gating: namely, that because the stimulus is repeated several times in small increments, listeners become locked in to particular perceptual hypotheses and are reluctant to given them up in the face of disconfirming information. Previous research by Cotton and Grosjean (1984) and Salasoo and Pisoni (1985) shows that such effects are negligible, and can be dismissed as a possible factor in the current research.
237
Segment
presence of nasalization cannot, ex hypothesis be interpreted as information about vowels, it is interpreted as constraining the class of consonants that can follow the vowel being heard.5 This is consistent with earlier research (Warren and Marslen-Wilson 1987) showing that listeners will select from the class of nasal consonants in making their responses, even before the place of articulation of this consonant is known. Turning to a surface-representation hypothesis, this makes the same predictions for nasalized vowels in English, but diverges for all other cases. If the representation of CVNs in the recognition lexicon codes the fact that the vowel is nasalized, then phonetically oral vowels should be unambiguous (listeners should never interpret CVCs as potential CVNs) in both Bengali and English, while phonetically nasal vowels should now be ambiguous in Bengali - ceteris paribus, listeners should be as willing to interpret vowel nasalization as evidence for a CVN as for a CVC. 933 Method 933.1 Materials and design Two sets of materials were constructed, for the Bengali and for the English parts of the study. We will describe first the Bengali stimuli. The primary set of Bengali stimuli consisted of twenty-one triplets of Bengali words, each containing a CVC, and CVN, and a CVC, where each member of the triplet shared the same initial oral consonant (or consonant cluster), and the same vowel (oral or nasal) but different in final consonant (which was either oral or nasal). An example set is the triplet /kap/, /kam/, /kap/. As far as possible the place of articulation of the word-final consonant was kept constant. The vowels [a, o, D, ae, e] and their nasal counterparts were used. We also attempted to match the members of each triplet for frequency of occurrence in the language. Since there are no published frequency norms for Bengali, it was necessary to rely on the subjective familiarity judgments of a native speaker (the first author). Judgments of this type correlate well with objective measures of frequency (e.g. Segui et al. 1982). The second set of Bengali stimuli consisted of twenty doublets, containing matched CVCs and CVNs, where there was no word in the language beginning with the same consonant and vowel, but where the vowel was a nasal. An example is the pair /lorn/, /lop/, where there is no lexical item in the language beginning with the sequence /16/. The absence of lexical items with the appropriate nasal vowels was checked in a standard Bengali dictionary 5
Ohala (this volume) unfortunately misses this point, which leads him to claim, quite wrongly, that there is an incompatibility between the general claims of the cohort model and the assumptions being made here about the processing of vowel nasalization in English. 238
9 Aditi Lahiri and William Marslen- Wilson
(Dev 1973). As before, place of articulation of the final consonant in each doublet was kept constant. We used the same vowels as for the triplets, with the addition of [i] and [u]. Given the absence of nasal vowels in English, only one set of stimuli was constructed. This was a set of twenty doublets, matched as closely as possible to the Bengali doublets in phonetic structure. The pairs were matched for frequency, using the Kucera and Francis (1967) norms, with a mean frequency for the CVNs of 18.2 and for the CVCs of 23.4. All of the stimuli were prepared in the same way for use in the gating task. The Bengali and English stimuli were recorded by native speakers of the respective languages. They were then digitized at a sampling rate of 20 kHz for editing and manipulation in the Max-Planck speech laboratory. Each gating sequence was organized as follows. All gates were set at a zero crossing. We wanted to be able to look systematically at responses relative both to vowel onset and to vowel offset. The first gate was therefore set, for all stimuli, at the end of the fourth glottal pulse after vowel onset. This first gate was variable in length. The gating sequence then continued through the vowel in approximately 40 msec, increments until the offset of the vowel was encountered. A gate boundary was always set at vowel offset, with the result that the last gate before vowel offset also varied in length for different stimuli. If the interval between the end of the last preceding gate and the offset of the vowel was less than 10 msec, (i.e., not more than one glottal pulse), then this last gate was simply increased in length by the necessary amount. If the interval to vowel offset was more than 10 msec, then an extra gate of variable length was inserted. After vowel offset the gating sequence then continued in steady 40 msec, increments until the end of the word. Figure 9.1 illustrates the complete gating sequence computed for one of the English stimuli. The location of the gates for the stimuli was determined using a highresolution visual display, assisted by auditory playback. When gates had been assigned to all of the stimuli, seven different experimental tapes were then constructed. Three of these were for the Bengali triplets, and each consisted of three practice items followed by twenty-one test items. The tapes were organized so that each tape contained an equal number of CVCs, CVNs, and CVCs, but only one item from each triplet, so that each subject heard a given initial CV combination only once during the experiment. A further two tapes were constructed for the Bengali doublets, again with three practice items followed by twenty test items, with members of each doublet assigned one to each tape. The final two tapes, for the English doublets, followed in structure the Bengali doublet tapes. On each tape, the successive gates were recorded at six-second intervals. A short warning tone preceded each gate, and a double tone marked the beginning of a new gating sequence. 239
Segment
-8
-7
-6
-5
-4
-3
-2
-1
0
+1
+2
+3
Figure 9.1 The complete gating sequence for the English word grade. Gate 0 marks the offset of the vowel
9.3.3.2 Subjects and procedure
For the English materials, twenty-eight subjects were tested, fourteen for each of the two experimental tapes. All subjects were native speakers of British English and were paid for their participation. For the Bengali materials, a total of sixty subjects were tested, thirty-six for the three triplet tapes, and twenty-four for the two doublet tapes. No subject heard more than one tape. The subjects were literate native speakers of Bengali, tested in Calcutta. They were paid, as appropriate, for their participation. The same testing procedure was followed throughout. The subjects were tested in groups of two to four, seated in a quiet room. They heard the stimuli over closed-ear headphones (Sennheiser HD222), as a binaural monophonic signal. They made their responses by writing down their word choices (each in their own script), with accompanying confidence rating, in the response booklets on the desk in front of them. The booklets were organized to allow for one gating sequence per page, and consisted of a series of numbered blank lines, with each line terminating in the numbers 1 to 10. The number 1 was labeled "Complete Guess" and the number 10 was labeled "Very Certain" (or the equivalent in Bengali). The subjects began the testing session by reading a set of written instructions, which explained the task, stressing the importance of (a) making a response to every gating fragment and of (b) writing down a complete word as a response every time and not just a representation of the sounds they thought they heard - even if they felt that their response was a complete guess. They were then questioned to make sure that they had understood the task. The three practice items then followed. The subjects' performance was 240
9 Aditi Lahiri and William Marslen-Wilson
checked after each practice sequence, to determine whether they were performing correctly. The main test sequence then followed, lasting for 30-35 minutes. 9.3.4 Results and discussion
The subjects' gating responses were analyzed so as to provide a breakdown, for each item, of the responses at each gate. All scoreable responses were classified either as CVCs, CVNs, or CVCs. The results are very clear, and follow exactly the pattern predicted by the underlying-representation hypothesis. Whether they are listening to phonetically nasal vowels or to oral vowels, listeners behave as if they are matching the incoming signal against a lexical representation which codes the vowels in CVCs and CVNs as oral and only the vowels in CVCs as nasal. We will consider first the Bengali triplets. 93.4.1 Bengali triplets Figure 9.2 gives the mean number of different types of response (CVC, CVC, or CVN) to each type of stimulus, plotted across thefivegates up to the offset of the vowel (gate 0 in the figure) and continuing for five gates into the final consonant. The top panel of figure 9.2 shows the responses to CVN stimuli, the middle panel the responses to CVC stimuli, and the bottom panel the responses to CVC stimuli. The crucial data here are for the first five gates. Once the listeners receive unambiguous information about the final consonant (starting from gate + 1), then their responses no longer discriminate between alternative theories of representation. What is important is how listeners respond to the vowel before the final consonant. To aid in the assessment of this we also include a statistical summary of the responses over the first five gates. Table 9.2 gives the overall mean percentage of responses of different types to each of the three stimulus types. Consider first the listeners' responses to the nasalized vowels in the CVNs and CVCs. For both stimulus types, the listeners show a strong bias towards CVC responses. They interpret the presence of nasalization as a cue that they are hearing a nasal vowel followed by an oral cosonant (a CVC) and not as signaling the nasality of the following consonant. This is very clear both from the response distributions in figure 9.2 and from the summary statistics in table 9.2. This lack of CVN responses to CVN stimuli (up to gate 0) cannot be attributed to any lack of nasalization of the vowel in these materials. The presence of the CVC responses shows that the vowel was indeed perceived as nasalized, and the close parallel between the CVC response curves for the 241
Segment
-5
-4
-3
-2
-1
0
+1
+2
+3
+4
+5
-5
-4
-3
-2
-1
0
+1
+2
+3
+4
+5
Gates Figure 9.2 Bengali triplets: mean percentage of different types of response (CVC, CVC, or CVN) to each type of stimulus, plotted across thefivegates up to offset of the vowel (gate 0) and continuing forfivegates into the consonant. The top panel gives the responses to CVN stimuli, the middle panel the responses to CVC stimuli, and the bottom panel plots the responses to CVC stimuli 242
9 Aditi Lahiri and William Marslen- Wilson Table 9.2 Bengali triplets: percent responses up to vowel offset Type of response CVC
CVC
CVN
80.3 33.2 23.5
0.7 56.8 63.0
13.4 5.2 7.9
Stimulus CVC CVC CVN
CVN and CVC stimuli shows that the degree of perceived nasalization was approximately equal for both types of stimulus. These results are problematic for a surface-representation hypothesis. On such an account, vowel nasalization is perceptually ambiguous, and responses should be more or less evenly divided between CVNs and CVCs. To explain the imbalance in favor of CVC responses, this account would have to postulate an additional source of bias, operating postperceptually to push the listener towards the nasal-vowel interpretation rather than the oralvowel/nasal-consonant reading. This becomes implausible as soon as we look at the pattern of responses to CVC stimuli, where oral vowels are followed by oral consonants. Performance here is dominated by CVC responses. Already at gate — 5 the proportion of CVC responses is higher than for the CVN or CVC stimuli, and remains fairly steady, at around 80 percent, for the next five gates. Consistent with this, there are essentially no CVC responses at all. In contrast, there is a relatively high frequency of CVN responses over the first five gates. Listeners produce more than twice as many CVN responses, on average, to CVC stimuli than they do to either CVN or CVC stimuli. This is difficult to explain on a surface-representation account. If CVNs are represented in the recognition lexicon as containing a nasalized vowel followed by a nasal consonant, then there should be no more reason to produce CVNs as responses to CVCs than there is to produce CVCs. And certainly, there should be no reason to expect CVN responses to be significantly more frequent to oral vowels than to nasalized vowels. On a surface-representation hypothesis these responses are simply mistakes which leaves unexplained why listeners do not make the same mistake with CVC responses. On the underlying-representation hypothesis, the pattern of results for CVC stimuli follows directly. The recognition lexicon represents CVCs as 243
Segment
having a nasal vowel. There is therefore no reason to make a CVC response when an oral, non-nasalized vowel is being heard. Both CVCs and CVNs, however, are represented as having an oral vowel (followed in the one case by an oral consonant and in the other by a nasal consonant). As far as the listener's recognition lexicon is concerned, therefore, it is just as appropriate to give CVNs as responses to oral vowels as it is to give CVCs. The exact distribution of CVC and CVN responses to CVCs (a ratio of roughly 4 to 1) presumably reflects the distributional facts of the language, with CVCs being far more frequent than CVNs. 9.3.4.2 Bengali doublets The second set of results involve the Bengali doublets. These were the stimulus sets composed of CVCs and CVNs, where there was no CVC in the language that shared the same initial consonant or vowel. Figure 9.3 gives the results across gates, showing the number of responses of different types to the two sets of stimuli, with the CVN stimuli in the upper panel and the CVC stimuli in the lower panel. Table 9.3 summarizes the overall mean percentage of responses of each type for the five gates leading up to vowel closure. Again, the results follow straightforwardly from the underlying-representation hypothesis and are difficult to explain on a surface-representation hypothesis. The CVC stimuli elicit the same response pattern as we found for the triplets. There are no nasal-vowel responses, an average of over 80 percent CVC responses, and the same percentage as before of CVN responses, reaching nearly 15 percent. The CVN stimuli elicit a quite different response pattern. Although the listeners again produced some CVN responses (averaging 16 percent over the first five gates), they also produce a surprising number of CVC responses (averaging 17 percent). In fact, for gates — 3 to 0, they produce more CVC responses than they do CVN responses. The way they do this is, effectively, by inventing new stimuli. Instead of producing the CVN that is lexically available, the listeners produce as responses CVCs that are closely related, phonetically, to the consonant-vowel sequence they are hearing. They either produce real words, whose initial consonant or medial vowel deviates minimally from the actual stimulus, or they invent nonsense words. This striking reluctance to produce a CVN response, even when the input is apparently unambiguous, cannot be explained on a surface-representation hypothesis. If the listener knows what a CVN sounds like, then why does he not produce one as a response when he hears one - and when, indeed, the lexicon of the language does not permit it to be anything else? In contrast, this difficulty in producing a CVN response follows directly from the underlying-representation hypothesis, where nasalization on the surface is 244
9 Aditi Lahiri and William Marslen- Wilson
Figure 9.3 Bengali doublets: mean percentage of different types of response (CVC, CVC, or CVN) plotted across gates, gate 0 marking the offset of the vowel. The upper panel gives responses to CVN stimuli and the lower panel the responses to CVC stimuli
Table 9.3 Bengali doublets: percent responses up to vowel offset Type of response CVC
CVC
CVN
82.6 64.7
0.0 17.0
14.7 15.6
Stimulus CVC CVN
245
Segment
Response type CVC CVN
Response type CVC CVN
Figure 9.4 Mean percentage of different types of responses to the two English stimulus sets across gates, gate 0 marking the vowel offset. Responses to CVN stimuli are plotted on the upper panel and to CVC stimuli on the lower panel
interpreted as a cue to an underlyingly nasal vowel. For the doublets, this will mean that the listener will not be able to find any perfect lexical match, since the CVN being heard is represented in the recognition lexicon as underlyingly oral, and there is no lexically available CVC. This predicts, as we observed here, that there should not be a large increase in CVN responses even when the CVN is lexically ambiguous. A CVC which diverges in some other feature from the input will be just as good a match as the CVN, at least until the nasal consonant is heard. 9.3.4.3 English doublets English has no underlying nasal-vowel segments so that vowel nasalization 246
9 Aditi Lahiri and William Marslen- Wilson Table 9.4 English Doublets: percent responses up to vowel offset Type of response
Stimulus
CVC CVN
CVC
CVN
83.4 59.3
16.6 40.7
appears only as an allophonic process, with the nasal feature spreading to an oral vowel from the following nasal consonant. This means that the interpretation of vowel nasalization in a CVN should be completely unambiguous. Figure 9.4 plots the responses across gates, showing the number of responses of different types to the two stimulus sets, with CVN stimuli in the upper panel and CVC stimuli in the lower. Table 9.4 summarizes the overall percentage of responses of each type for the five gates up to vowel offset. The responses to the CVN stimuli straightforwardly follow from the phonological and phonetic conditions under which the stimuli are being interpreted. There is already a relatively high proportion of CVN responses at gate — 5, indicating an early onset of nasalization, and the proportion of CVN responses increases steadily to vowel offset.6 The overall proportion of CVN responses for these first five gates (at 41 percent) is quite similar to the combined total of nonoral responses (33 percent) for the Bengali CVNs. This suggests that the stimulus sets in the two languages are effectively equivalent in degree of vowel nasalization. The overall pattern of responses to the English CVC stimuli closely parallels the pattern of responses for the Bengali doublets. There is the same overall proportion of CVC responses, and the number of CVN responses, at 17 percent, is very close to the 15 percent found for the Bengali doublet CVCs. This is, again, a pattern which follows much more naturally from an underlying-representation hypothesis than from a surface-representation account. If English listeners construct their representation of CVNs on the basis of their sensory experience with the phonetic realization of CVNs, then this representation should code the fact that these words are produced with nasalized vowels. And if this was captured in the representation in the recognition lexicon, then the CVN responses to phonetically oral vowels can only be explained as mistakes. 6
Note that, contrary to Ohala (this volume), the total amount of nasalization, either at vowel onset or vowel offset, is not at issue here. What is important is how nasalization is interpreted, not whether or not it occurs. 247
Segment Table 9.5 CVN responses to CVC stimuli: place effects across gates (percent response) Gates
Correct place Incorrect place
-5
-4
_3
-2
-1
0
+1
12.0 9.5
14.5 5.0
15.5 6.0
11.5 5.0
13.5 2.5
10.0 1.5
21.0 0.0
+2 1.5 0.0
In contrast, on the underlying-representation story, the listener simply has no basis for discriminating CVCs from CVNs when hearing an oral vowel. The underlying representation is unspecified in terms of the feature [ +nasal]. This leads to a basic asymmetry in the information value of the presence as opposed to the absence of nasality. When an English vowel is nasalized, this is an unambiguous cue to the manner of articulation of the following consonant, and as more of the vowel is heard the cue gets stronger, leading to an increased proportion of CVN responses, as we see infigure9.4. But this is not because the cue of nasalization is interpreted in terms of the properties of the vowel. It is interpreted in terms of the properties of the following consonant, which is specified underlyingly as [ + nasal]. The processing system seems well able to separate out simultaneous cues belonging to different segments, and here we see a basis for this capacity in terms of the properties of the representation onto which the speech input is being mapped. The absence of nasality is not informative in the same way. Hearing more of an oral vowel does not significantly increase the number of CVC responses or decrease the number of CVN responses. The slight drop-off we see in figure 9.4 up to gate 0 reflects the appearance of cues to the place of articulation of the following consonant, rather than the accumulation of cues to orality. Some of the CVN responses produced to CVC stimuli at the earlier gates do not share place of articulation with the CVC being heard (for example, giving bang as a response to bad). It is these responses that drop out as vowel offset approaches, as table 9.5 illustrates. Table 9.5 gives the CVN responses to CVC stimuli, listed according to the correctness of the place of articulation of the CVN response. The lack of change in correct place responses over the five gates to vowel offset (gate 0) emphasizes the uninformativeness of the absence of nasality. This follows directly from the underlying-representation hypothesis, and from the properties of this representation as suggested by phonological theory. Only unpredictable information is specified in the underlying representation, and since 248
9 Aditi Lahiri and William Marslen-Wilson
orality is the universally unmarked case for vowels, oral vowels have no specification along the oral/nasal dimension. The underlying representation of the vowel in CVCs and CVNs is therefore blind to the fact that the vowel is oral. The listener will only stop producing CVNs as responses when it becomes clear that the following consonant is also oral. An important aspect, finally, of the results for the English doublets is that they provide evidence for the generality of the claims we are making here. Despite the contrasting phonological status of nasality in the Bengali vowel system as opposed to the English, both languages treat oral vowels in the same way, and with similar consequences for the ways in which the speakers of these languages are able to interpret the absence of nasality in a vowel. Although vowel nasalization has a very different interpretation in Bengali than in English, leading to exactly opposite perceptual consequences, the presence of an oral vowel leads to very similar ambiguities for listeners in both languages. In both Bengali and English, an oral vowel does not discriminate between CVC and CVN responses. 9.4 Processing and representation of length
In the preceding section, we investigated the processing of the feature [nasal] in two different languages, showing that the interpretation of phonetically nasal and oral vowels patterned in the ways predicted by a phonologically based theory of the recognition lexicon. In this section we consider the processing and representation of a different type of segmental contrast length. We will be concerned specifically with consonantal length and the contrast between single and geminate consonants. Length is not represented by any feature [long] (see section 9.2.1); rather, the featural specifications of a geminate and a single consonant are exactly the same. The difference lies in the linking between the melody and the skeletal representations - dual for geminates and single for nongeminates. The language we will be studying is again Bengali, where nongeminate and geminate consonants contrast underlyingly in intervocalic position, as in /pala/ "turn" /paha/ "scale, competition." The predominant acoustic cue marking geminate consonants is the duration of the consonantal closure (Lahiri and Hankamer 1988). This is approximately twice as long in geminate as opposed to nongeminate consonants (251 vs. 129 msec, in the materials studied by Lahiri and Hankamer, and 188 vs. 76 msec, in the sonorant liquids and nasals used in this study). As we discussed earlier, spectral information like nasality is interpreted with respect to the appropriate representation as soon as it is perceived. For geminates, the involvement of the skeleton along with the melody predicts a different pattern of listener responses. Unlike nasality, interpretation of surface duration will depend not on the lexically 249
Segment
marked status of the feature but on the listener's assessment of the segment slots and therefore of the prosodic structure. This means that information about duration will not be informative in the same way as spectral information. For example, when the listener hears nasal-murmur information during the closure of the geminate consonant in a word like [kania], the qualitative information that the consonant is a nasal can be immediately inferred. But the quantity of the consonant (the fact that it is geminated) will not be deduced - even after approximately 180 msec, of the nasal murmur - until after the point of release of the consonant, where this also includes some information about the following vowel. Since geminates in Bengali are intervocalic, duration can only function as a cue when it is structurally (or prosodically) plausible - in other words, when there is a following vowel. If, in contrast, length had been marked as a feature on the consonant in the same way as standard melodic features, then duration information ought to function as a cue just like nasality. Instead, if quantity is independently represented, then double-consonant responses to consonantal quantity will only emerge when there is positive evidence that the appropriate structural environment is present. 9.4.1 Method The stimuli consisted of sixteen pairs of matched bisyllabic geminate/ nongeminate pairs, differing only in their intervocalic consonant (e.g. [pala] and [palia]). The contrasting consonants consisted of eight [l]s, seven [n]s, and one [m]. The primary reason for choosing these sonorant consonants was that the period of consonantal closure was not silent (as, for example, in unvoiced stops), but contained a continuous voiced murmur, indicating to the listener the duration of the closure at any gate preceding the consonantal release at the end of the closure. The geminates in these pairs were all underlying geminates. We attempted to match the members of each pair for frequency, again relying on subjective familiarity judgments. The stimuli were prepared for use in the gating task in the same way as in the first experiment. The gating sequence was, however, organized differently. Each word had six gates (as illustrated in figure 9.5). The first gate consisted of the CV plus 20 msec, of the closure. The second gate included an extra 40 msec, consonantal-closure information. The third gate was set at the end of the closure, before any release information was present. This gate differed in length for geminate and nongeminates, since the closure was more than twice as long for the geminates. It is at this gate, where listeners have heard a closure whose duration far exceeds any closure that could be associated with a nongeminated consonant, that we would expect geminate responses if listeners were sensitive to duration alone. The fourth gate 250
9 Aditi Lahiri and William Marslen- Wilson
Jilliuiiiu
—Viewpoint width: 550 msec.-
1
H rri HI" If! 111111' kana
I
50 msec.
1 1, 1
i
i
i
1 23 4 5 —Viewpoint width: 550 msec.-*
Hipp
Hi Wffl
6
lyiiiM
PPPPWWvvkanna
50 msec. 1
i
1 2
i
i
Gates
i i i
3 4
5
1
Figure 9.5 The completing gating sequence for the Bengali pair kana and kanna
included the release plus two glottal pulses - enough information to indicate that there was a vowel even though the vowel quality was not yet clear. The fifth gate contained another four glottal pluses - making the identity of the vowel quite clear. The sixth and last gate included the whole word. Two tapes were constructed, with each tape containing one member of each geminate/nongeminate pair, for a total of eight geminates and eight nongeminates. Three practice items preceded the test items. A total of twenty-eight subjects were tested, fourteen for each tape. No subject heard more than one tape. The subjects were literate native speakers of Bengali, tested in Calcutta. The testing procedure was the same as before (see section 9.3.3), except for the fact that here the subjects were instructed to respond with bisyllabic words. Bisyllabic responses allowed us to determine whether 251
Segment
listeners interpreted quantity information at each gate as geminate or nnnapminatp;
9.4.2 Results and discussion
The questions that we are asking in this research concern the nature of the representation onto which durational information is mapped. Do listeners interpret consonant length as soon as the acoustic information is available to them (like nasality or other quality cues) or do they wait till they have appropriate information about the skeletal structure? To answer this question, all scoreable responses for the geminate/nongeminate stimuli pairs were classified as CVCV (nongeminate) or CVCiV (geminate). Two pairs of words were discarded at this point. In one pair, the third gate edge had been incorrectly placed for one item, and for the second pair, the initial vowel of the nongeminate member turned out to be different for some of the subjects. The subsequent analyses are based on fourteen pairs of geminate/nongeminate pairs. In figure 9.6, we give the mean percentage of geminate responses to the geminate and nongeminate pairs plotted across the six gates. The crucial data here are for the third and fourth gates. At the third gate, the entire closure information was present both for the CVCV (nongeminate) words as well as the CVCiV words (geminates). We know from earlier experiments (Lahiri and Hankamer 1988) that cross-splicing the closure from geminates to nongeminates and vice versa, with all other cues remaining constant, can change the listeners' percept. The duration of closure overrides any other cue in the perception of geminates. One might expect that at the third gate the listeners may use this cue and respond with geminates. But that is not the case. There is still a strong preference for nongeminate responses (83 percent nongeminate vs. 17 percent geminate) even though listeners at this gate are hearing closures averaging 170 msec, in duration, which could never be associated with a nongeminate consonant. Instead, durational information is not reliably interpreted till after the fourth gate, where they hear the release and two glottal pulses of the vowel, indicating that this is a bisyllabic word. Note that although listeners were reluctant to give bisyllabic geminated words as responses until,gate 4, the same restrictions did not apply to bisyllabic words which had intervocalic consonant clusters, with the second member being an obstruent. Four of the geminate/nongeminate pairs used here had competitors of this type - for example, the word [palki], sharing its initial syllable with [paha] and [pala]. For this subset of the stimuli, we find that in gate 3 of the geminates (with all the closure information available), CVCCV responses average 45 percent - as opposed to only 16 percent CVCiV responses. Listeners prefer to interpret the speech input up to this 252
9 Aditi Lahiri and William Marslen- Wilson 100 -i
—*— Geminates "•*•"
Nongeminates
Figure 9.6 Mean percentage of geminate responses for the geminate and nongeminate stimuli plotted across the six gates
point as a single linked C either followed by a vowel (CVCV) or by a consonant (CVCCVC). Dually linked Cs (i.e. geminates) are reliably given as responses only after there is information to indicate that there is a following vowel - at which point, of course, CVCCV responses drop out completely. The infrequency of geminated responses at gate 3 - and the frequency of CVCV and CVCCV (when available) responses - is compatible with the hypothesis that the speech input is interpreted correctly with respect to the melodic information, and that bisyllabic words given in response contain a separate melodic segment. The [pal] part of [paha] is matched to all sequences with [pal] including the geminate. But the information available to the listener at that point is that it is an [1], and the word candidates that are more highly activated are those that can follow a single [1], namely a vowel or another consonant - that is, another singly linked segment like [palki] or [pala], leading to the observed preference for nongeminate words as responses. Even if the closure is long and ought to trigger geminate responses, the melodic input available is incompatible with the available structural information - a geminate [h] evidently cannot be interpreted as such unless both the melodic and the structural information is made available. But as soon as it is clear that there is a following vowel, then the acoustic input can be correctly interpreted. 253
Segment 9.5 Conclusions
In the paper we have sketched the outline of a psycholinguistic model of the representation and processing of lexical form, combining the cohort model of lexical access and selection with a set of assumptions about the contents of the recognition lexicon that derive from current phonological theory. The joint predictions of this model, specifying how information in the speech signal should be interpreted relative to abstract, multilevel, underspecified, lexical representations, were evaluated in two experiments. The first of these investigated the processing of a melodic feature, under conditions where its phonological status varied cross-linguistically. The second investigated the interpretation of quantity information in the signal, as a function of cues to the structural organization of the word being heard. In each case, the responses patterned in a way that was predicted by the model, suggesting that the perceptually relevant representations of lexical form are indeed functionally isomorphic to the kinds of representations specified in phonological theory. If this conclusion is correct, then there are two major sets of implications. Where psycholinguistic models are concerned, this provides, perhaps for the first time, the possibility of a coherent basis for processing models of lexical access and selection. These models cannot be made sufficiently precise without a proper specification of the perceptual targets of these processes, where these targets are the mental representations of lexical form. Our research suggests that phonological concepts can provide the basis for a solution to this problem. The second set of implications are for phonological theory. If it is indeed the case that the contents of the mental-recognition lexicon can, in broad outline, be characterized in phonological terms, then this suggests a much closer link between phonological theory and experimental research into language processing than has normally been considered by phonologists (or indeed by psychologists). Certainly, if a phonological theory has as its goal the making of statements about mental representations of linguistic knowledge, then the evaluation of its claims about the properties of these representations will have to take into account experimental research into the properties of the recognition lexicon.
254
9 Comments
Comments on chapter 9 JOHN J. OHALA I applaud Lahiri and Marslen-Wilson for putting underspecification and other phonological hypotheses into the empirical arena.* I share their view that phonological hypotheses that purport to explain how language is represented in the brain of the speaker have to be evaluated via psycholinguistic tests. It is a reasonable proposal that the way speakers recognize heard words should be influenced by the form of their lexical representation, about which underspecification has made some very specific claims. Still, this paper leaves me very confused. Among its working hypotheses are two that seem to me to be absolutely contradictory. The first of these is that word recognition will be performed by making matches between the redundant input form and an underlying lexical form which contains no predictable information. The second is that the match is made between the input form and a derived form which includes surface redundancies. Previous work by Marslen-Wilson provides convincing support for the second hypothesis. He showed that as soon as a lexically distinguishing feature appeared in a word, listeners would be able to make use of it in a wordrecognition task; e.g., the nasalization of the vowel in drown could be used to identify the word well before the actual nasal consonant appeared. Similar results had earlier been obtained by Ali et al. (1971). Such vowel nasalization is regarded by Lahiri and Marslen-Wilson as predictable and therefore not present in the underlying representation. In the Bengali triplet experiment the higher number of CVN responses (than CVC responses) to CVC stimuli is attributed to the vowel in CVN's underlying representation being unspecified for [nasal], i.e., that to the listener the oral vowel in CVC could be confused with the underlying "oral" vowel in CVN. A similar account is given for the results of the Bengali doublet experiment (the results of which are odd since they include responses for CVC even though there were supposedly no words of this sort in the language). On the other hand, the progressive increase in the CVN responses to English CVN stimuli is interpreted as due to listeners being able to take advantage of the unambiguous cue which the redundant vowel nasalization offers to the nature of the postvocalic consonant. Whatever the results, it seems, one could invoke the listener having or not having access to predictable surface features of words to explain them. Actually, given the earlier results of Ali et al. that a majority of listeners, hearing only the first half of a CVC or CVN syllable, could nevertheless discriminate them, Lahiri and Marslen-Wilson's results in the English *I thank Anne Cutler for bibliographic leads. 255
Segment
experiment (summarized in fig. 9.4) are puzzling. Why did CVN responses exceed the CVC responses only at the truncation point made at the VN junction? One would have expected listeners to be able to tell the vowel was nasalized well before this point. I suspect that part of the explanation lies in the way the stimuli were presented, i.e. the shortest stimulus followed by the next shortest, and so on through to the longest and least truncated one. What one is likely to get in such a case is a kind of "hysteresis" effect, where subjects' judgments on a given stimulus in the series is influenced by their judgment on the preceding one. This effect has been studied by Frederiksen (1967), who also reviews prior work. In the present case, since a vowel in a CVN syllable is least nasalized at the beginning of the vowel, listeners' initial judgments that the syllable is not one with a nasalized vowel is reasonable but then the hysteresis effect would make them retain that judgment even though subsequent stimuli present more auditory evidence to the contrary. Whether this is the explanation for the relatively low CVN responses is easily checked by doing this experiment with the stimuli randomized and unblocked, although this would require other changes in the test design so that presentation of the full stimulus would not bias judgments on the truncated stimuli. This could be done in a number of ways, e.g. by presenting stimuli with different truncations to different sets of listeners. Lahiri and Marslen-Wilson remark that the greater number of CVC over CVN responses to the CVN stimuli in the Bengali triplet study are problematic for a surface-representation hypothesis. On such an account, vowel nasalization is perceptually ambiguous, and responses should be more or less evenly divided between CVNs and CVCs. To explain the imbalance in favor of CVC responses, this account would have to postulate an additional source of bias, operating postperceptually. Such a perceptual bias has been previously discussed: Ohala (1981b, 1985a, 1986, forthcoming) and Ohala and Feder (1987) have presented phonological and phonetic evidence that listeners are aware of predictable cooccurrences of phonetic events in speech and react differently to predictable events than unpredictable ones. If nasalization is a predictable feature of vowels adjacent to nasal consonants, it is discounted in that it is noticed less than that on vowels not adjacent to nasal consonants. Kawasaki (1986) has presented experimental evidence supporting this. With a nasal consonant present at the end of a word, the nasalization on the vowel in camouflaged. When Lahiri and Marslen-Wilson's subjects heard the CVN stimuli without the final N they heard nasalized vowels uncamouflaged and would thus find the nasalization more salient and distinctive (since the vowel nasalization cannot be "blamed" on any nearby nasal). They would therefore think of CVC stimuli first. This constitutes an extension to a very fine-grained phonetic level of the 256
9 Comments
same kind of predictability that Warren (1970) demonstrated at lexical, syntactic, and pragmatic levels. Even though listeners may show some bias in interpreting incoming signals depending on the redundancies that exist between various elements of the signal, it does not seem warranted to jump to the conclusion that all predictable elements are unrepresented at the deepest levels. The result in the geminate experiment, that listeners favored responses with underlying geminates over derived geminates, was interpreted by Lahiri and Marslen-Wilson as due to listeners opting preferentially for a word with a lexical specification of gemination (and thus supporting the notion that derived geminates, those that arose from original transmorphemic heterorganic clusters, e.g. -rl- > -11-, have a different underlying representation. However, there is another possible interpretation for these data that relies more on morphology than phonology. Taft (1978, 1979) found that upon hearing [deiz] subjects tended to identify the word as daze rather than days, even though the latter was a far more common word. He suggested that uninflected words are the preferred candidates for word identification when there is some ambiguity between inflected and uninflected choices. Giinther (1988) later demonstrated that, rather than being simply a matter of inflected vs. uninflected, it was more that base forms are preferred over oblique forms. By either interpretation, a Bengali word such as /paha/ "scale" would be the preferred response over the morphologically complex or oblique form /paho/ "to be able to (third-person past)." At present, then, I think the claims made by underspecification theory and metrical phonology about the lexical representation of words are unproven. Commendably, Lahiri and Marslen-Wilson have shown us in a general way how perceptual studies can be brought to bear on such phonological issues; with refinements, I am sure that the evidential value of such experiments can be improved.
Comments on chapter 9 CATHERINE P. BROWMAN Lahiri and Marslen-Wilson's study explores whether the processes of lexical access and selection proceed with respect to underlying abstract phonological representations or with respect to surface forms. Using a gating task, two constrasts are explored: the oral/nasal contrast in English and Bengali, and the single/geminate contrast in Bengali. Lahiri and Marslen-Wilson argue that for both contrasts, it is the underlying form (rather than the surface form) that is important in explaining the results of the gating task, where the 257
Segment underlying forms are assumed to differ for the two contrasts, as follows (replacing their CV notation with X notation): (1)
Bengali oral/nasal contrast: VN
VC
XX
XX
I [ + nas] (2)
VC XX
I [ + nas]
Bengali single/geminate contrast C: XX
C X
The English contrast is like the Bengali, but is only two-way (there are no lexically distinctive nasalized vowels in English). On the surface, the vowel preceding a nasal consonant is nasalized in both Bengali and English. The underlying relation between the x-tier and the featural tiers is the same for the nasalized vowel and the singleton consonant - in both cases, the relevant featural information is associated with a single x (timing unit). This relation differs for the nasal consonant and the geminate consonant - the relevant featural information ([nas]) for the nasal consonant is associated with a single x, but the featural information for the geminate is associated with two timing units. However, this latter assumption about underlying structure misses an important generalization about similarities between the behaviour, on the gating task, of the nasals and the geminates. These similarities can be captured if the nasalization for the nasal consonant is assumed to be underlyingly associated with two timing units, as in (3), rather than with a single timing unit, as in (1). (3)
Proposed Bengali oral/nasal contrast: VN
VC
XX
i
VC
XX
I
[ + nas ]
XX
I [ + nas]
There is some articulatory evidence that is suggestive as to similarity between syllable-final nasals and (oral) geminates. Krakow (1989) demon258
9 Comments Velum
Tongue
Speech envelope
N (a)
Tongue
Speech envelope
C: (b)
Figure 9.7 Schematic relations among articulations and speech envelope, assuming a consonant that employs the tongue: (a) nasal consonant preceded by nasalized vowel (schematic tongue articulation for consonant only); (b) geminate consonant
strates that, in English, the velopharyngeal port opens as rapidly for syllablefinal nasals as for syllable-initial nasals, but is held open much longer (throughout the preceding vowel). This is analogous to the articulatory difference between Finnish singleton and geminate labial consonants, currently being investigated at Yale by Margaret Dunn. If the nasal in VN sequences is considered to be an underlying geminate, then the similar behavior of the nasals and oral geminates on the gating task can be explained in the same way: lexical geminates are not accessed until their complete structural description is met. The difference between nasal 259
Segment
(VN) and oral (C:) geminates then lies in when the structural description is met, rather than in their underlying representation. In both cases, it is the next acoustic event that is important. As can be seen in figure 9.7, this event occurs earlier for VN than for Ci. For VN, it is clear that the nasal is a geminate when the acoustic signal changes from the nasalized vowel to the nasal consonant, whereas for d , it is not until the following vowel that the acoustic signal changes. The preference for oral responses during the vowel of the VN in the doublets experiment would follow from the structural description for the nasal geminate not yet being met, combined with a possible tendency for "oral" vowels to be slightly nasalized (Henderson 1984, for Hindi and English). This interpretation would also predict that the vowel quality should differ in oral responses to VN stimuli, since the acoustic information associated with the nasalization should be (partially) interpreted as vowelquality information rather than as nasalization.
260
10
The descriptive role of segments: evidence from assimilation FRANCIS NOLAN
10.1 Introduction
Millennia of alphabetic writing can leave little doubt as to the utility of phoneme-sized segments in linguistic description.* Western thought is so dominated by this so successful way of representing language visually that the linguistic sciences have tended to incorporate the phoneme-sized segment (henceforth "segment") axiomatically. But, as participants in a laboratoryphonology conference will be the last to need reminding, the descriptive domain of the segment is limited. All work examining the detail of speech performance has worried about the relation between the discreteness of a segmental representation, on the one hand, and, on the other, the physical speech event, which is more nearly continuous and where such discrete events as can be discerned may correspond poorly with segments. One response has been to seek principles governing a process of translation1 between a presumed symbolic, segmental representation as input to the speech-production mechanism, and the overlapping or blended activities observable in the speech event. In phonology, too, the recognition of patternings which involve domains other than the segment, and the apparent potential for a phonetic component to behave autonomously from the segment(s) over which it stretches, have been part of the motivation for sporadic attempts to free phonological description from a purely segmental cast. Harris (1944), and Firth (1948), *The work on place assimilation reported here, with the exception of that done by Martin Barry, was funded as part of grants C00232227 and R000231056 from the Economic and Social Research Council, and carried out by Paul Kerswill, Susan Wright, and Howard Cobb, successively. I am very grateful to the above-named for their work and ideas; they may not, of course, be fully in agreement with the interpretations given in this paper. 1 The term "translation" was popularized by Fowler (e.g. Fowler et al. 1980). More traditionally, the process is called "(phonetic) implementation." 261
Segment
may be seen as forerunners to the current very extensive exploration of the effects of loosening the segmental constraints on phonological description under the general heading of autosegmental phonology. Indeed, so successful has this current phonological paradigm been at providing insights into certain phonological patterns that its proponents might jib at its being called a "sporadic attempt." Assimilation is one aspect of segmental phonology which has been revealingly represented within autosegmental notation. Assimilation is taken to be where two distinct underlying segments abut, and one "adopts" characteristics of the other to become more similar, or even identical, to it, as in cases such as [griim peint] green paint, [reg ka:] red car, [bae4 Ooits] bad thoughts. A purely segmental model would have to treat this as a substitution of a complete segment. Most modern phonology, of course, would treat the process in terms of features. Conventionally this would mean an assimilation rule of the following type:2 (1)
+ coronal + anterior — continuant
a coronal P anterior y distr
a coronal (3 anterior y distr
Such a notation fails to show why certain subsets of features, and not other subsets, seem to operate in unison in such assimilations, thus in this example failing to capture the traditional insight that these changes involve assimilation of place of articulation. The notation also implies an active matching of feature values, and (to the extent that such phonological representations can be thought of as having relevance to production) a repeated issuing of identical articulatory "commands." A more attractive representation of assimilation can be offered within an essentially autosegmental framework. Figure 10.1, adapted from Clements (1985), shows that if features are represented as being hierarchically organized on the basis of functional groupings, and if each segmental "slot" in the time course of an utterance is associated with nodes at the different levels of the hierarchy, then an assimilation can be represented as the deletion of an association to one or more lower nodes and a reassociation to the equivalent node of the following time slot. The hierarchical organization of the features captures notions such as "place of articulation"; and the autosegmental mechanism of deletion and reassociation seems more in tune with an intuitive conception of assimilation as a kind of programmed "short2
This formulation is for exemplification only, and ignores the question of the degree of optionality of place assimilation in different contexts.
262
10 Francis Nolan Timing tier Root tier Laryngeal tier
Supralaryngeal tier
Place tier
Figure 10.1 Multitiered, autosegmental representation of assimilation of place of articulation (after Clements 1985: 237)
cut" in the phonetic plan to save the articulators the bother of making one part of a complex gesture. But even this notation, although it breaks away from a strict linear sequence of segments, bears the hallmark of the segmental tradition. In particular, it still portrays assimilation as a discrete switch from one subset of segment values to another. How well does this fit the facts of assimilation? Unfortunately, it is not clear how much reliance we can place on the "facts" as known, since much of what is assumed is based on a framework of phonetic description which itself is dominated by the discrete segment. This paper presents findings from work carried out in Cambridge aimed at investigating assimilation experimentally. In particular, the experiments address the following questions: 1 Does articulation mirror the discrete change implied by phonetic and phonological representations of assimilation? 2 If assimilation turns out to be a gradual process, how is the articulatory continuum of forms responded to perceptually? In the light of the experimental findings, the status of the segment in phonetic description is reconsidered, and the nature of representation input to the production mechanism discussed. 10.2 The articulation of assimilation The main tool used in studying the articulatory detail of assimilation has been electropalatography (EPG). This involves a subject wearing an artificial palate in which are embedded sixty-two electrodes. Tongue contact with any electrode is registered by a computer, which stores data continuously as the subject speaks. "Frames" of data can then be displayed in the form of a plan 263
Segment
of the palate with areas of contact marked. Each frame shows the pattern of contact for a 1/100 second interval. For more details of the technique see Hardcastle (1972). The experimental method in the early EPG work on assimilation, as reported for instance in Kerswill (1985) and Barry (1985), exploited (near) minimal pairs such as . . . maid couldn't... and . . . Craig couldn't... In these the test item {maid couldn't) contains, lexically, an alveolar at a potential place-of-articulation assimilation site, that is, before a velar or labial, and the control item contains lexically the relevant all-velar (or all-labial) sequence. It immediately became obvious from the EPG data that the answer to question 1 above, whether articulation mirrors the discrete change implied by phonetic and phonological representations of assimilation, is "no." For tokens with lexical alveolars, a continuum of contact patterns was found, ranging from complete occlusion at the alveolar ridge to patterns which were indistinguishable from those of the relevant lexical nonalveolar sequence. Figure 10.2 shows, for the utterance . . . late calls... spoken by subject WJ, a range of degrees of accomplishment of the alveolar occlusion, and for comparison a token by WJ of the control utterance . . . make calls . . . In all cases just the medial consonant sequence and a short part of the abutting vowels are shown. In each numbered frame of EPG data the alveolar ridge is at the top of the schematic plan of the palate, and the bottom row of electrodes corresponds approximately to the back of the hard palate. Panel (a) shows a complete alveolar occlusion (frames 0159 to 0163). Panels (b) and (c) show tokens where tongue contact extends well forward along the sides of the palate, but closure across the alveolar ridge is lacking. Panel (d) shows the lexical all-velar sequence in make calls. Figure 10.3 presents tokens spoken by KG of . . . boat covered... and . . . oak cupboard... In (a) there is a complete alveolar closure; almost certainly the lack of contact at the left side of the palate just means that the stop is sealed by this speaker, in this vowel context, rather below the leftmost line of electrodes, perhaps against the teeth. The pattern in (b) is very similar, except that the gesture towards the alveolar ridge has not completed the closure. In (c), however, although it is a third token of boat covered, the pattern is almost identical to that for oak cupboard in (d). In particular, in both (c) and (d), at no point in the stop sequence is there contact further forward than row 4. Both Barry (1985) and Kerswill (1985) were interested in the effect of speaking rate on connected-speech processes (CSPs) such as assimilation, and their subjects were asked to produce their tokens at a variety of rates. In general, there was a tendency to make less alveolar contact in faster tokens, but some evidence also emerged that when asked to speak fast but "carefully," speakers could override this tendency. However, the main point to be 264
10 Francis Nolan
>166
0167
0168
(a) ...latecalls...
0378
0379
0380
0381
0382
0383
0384
0385
0386
.00
0
.00
0000.000
00
0387
00. . 0000
0388
0389
(b) ...latecalls...
0
.. .0 0
0 .
° 0
0
0
°
0
0
0
00
0
0. . . . . .0
0. . .
0
0. . . . . .0
0
0 ....00
0. 0. . . . . 0 0
00...0
0
00...0
o
o
o
0. . . . . 0 0
0 0 . . . .
0
00.
. . .
O
"
"
•
•
: : : :
0570
0571
0572
.180
1181
1182
0
(c) ...latecalls...
(d) ...make calls...
Figure 10.2 EPG patterns for consonant sequences (subject WJ)
265
Segment
(a) ...boatjpvered..
(b) ...boat_covered...
101
(c) ...boat_cpvered...
0
00
0 0 .
. .
. 0 0
(d) ...oakcupboard... Figure 10.3 EPG patterns for consonant sequences (subject KG) 266
0902
0903
10 Francis Nolan
made here is that a continuum of degrees of realization exists for an alveolar in a place-assimilation context. The problem arises of how to quantify the palatographic data. The solution has been to define four articulation types. In this paper these will be referred to as full alveolar, residual alveolar, zero alveolar, and nonalveolar.
Thefirstthree terms apply to tokens with a potentially assimilable alveolar in their lexical form, and the last term indicates tokens of a control item with an all-velar or all-labial sequence in its lexical form. A full-alveolar token shows a complete closure across one or more rows near the top of the display. A residual-alveolar token lacks median closure at the top of the display, but shows contact along the sides of the palate a minimum of one row further forward than in the control non-alveolar token for the same rate by the same speaker. A zero-alveolar token shows no contact further forward than the appropriate nonalveolar control.3 To give an indication of the distribution of these articulation types in different speaking styles, Table 10.1 summarizes data from Barry (1985) and Kerswill (1985). On the left are shown the number of occurrences in normal conversational reading and in fast reading of the three alveolar articulationtypes. The data is here averaged over three EPG subjects who exhibited rather different "centers of gravity" on the assimilatory continuum, but all of whom varied with style. Each speaker produced two tokens of each of eight sentences containing place-assimilation sites, plus controls. A trend towards more assimilation in the fast style is evident. On the right are show the number of occurrences of the three articulation-types for Kerswill's speaker ML, who read in four styles: slowly and carefully, normally, fast, and then fast but carefully. The speaker produced in each style twenty assimilable forms. A clear pattern emerges of increasing assimilation as the style moves away from slow careful, which was confirmed statistically by Kerswill using Kendall's tau nonparametric correlation test (tau = 0.456, p< 0.001). There are, of course, limits on what can be inferred from EPG data. Because it records only tongue contact, it gives no information on possible differences in tongue position or contour short of contact. This means, in particular, that it is premature at this stage to conclude that even zeroalveolar tokens represent a complete assimilation of the tongue gesture of the first consonant to that of the second. X-rayfilming,perhaps supplemented by electromyography of tongue muscles, might shed light on whether, even in EPG-determined zero alveolars, there remains a trace of the lexical alveolar. 3
I have in the past (e.g. Nolan 1986) used the term "partial assimilation" to refer to cases where residual-alveolar contact is evident. I now realize that phonologists use "partial assimilation" to indicate e.g. [gk] as the output of the assimilation of [dk], in contrast to the "completely" assimilated form [kk]. I shall therefore avoid using the term "partial assimilation" in this paper. 267
Segment Table 10.1 EPG analysis of place assimilation: number of occurrences of articulation types in different speaking styles. On the left, the average of three speakers (Barry 1985: table 2) speaking at normal conversational speed and fast; and on the right speaker ML (Kerswill 1985: table 2) speaking slowly and carefully, normally, fast but carefully, and fast Normal Fast Full alveolar Residual alveolar Zero alveolar
23 14 11
15 15 18
Slow and careful Normal Fast and careful Fast 10 5 5
2 8 10
3 3 14
0 2 18
These techniques are outside the scope of the present projects; but the possibility of alveolar traces remaining in zero alveolars is approached via an alternative, perceptual, route in section 10.3. Despite these limitations, these experiments make clear that the answer to question 1 is that assimilation of place of articulation is, at least in part, a gradual process rather than a discrete change. 10.3 The perception of assimilation Since assimilation of place of articulation has been shown, in articulatory terms, to be gradual, question 2 above becomes relevant: how is the articulatory continuum of forms responded to perceptually? For instance, when does the road collapsed begin consistently to be perceived, if ever, as the rogue collapsed! Is a residual alveolar sufficient to cue perception of a lexical alveolar? Can alveolars be identified even in the absence of any signs of them on the EPG pattern? An ideal way to address this issue would be to take a fully perfected, comprehensive, and tractable articulatory synthesizer; create acoustic stimuli corresponding to the kinds of articulation noted in the EPG data; and elicit responses from listeners. Unfortunately, as far as is known, articulatory synthesizers are not yet at a stage of development where they could be relied on to produce such stimuli with adequate realism; nor, furthermore, is the nature of place assimilation well enough understood yet at the articulatory level for it to be possible to provide the data to control the synthesizer's parameters with confidence that human activity was being simulated accurately. An alternative strategy was therefore needed. The strategy was to use stimuli produced by a human speaker. All variables other than the consonant sequence of interest needed to be controlled, and, since there is a strong tendency for the range of forms of 268
10 Francis Nolan Table 10.2 Sentence pairs used in the identification and analysis tests d+k g -I- k d+k g+k d+k g+k d+k g+k d+g g +g d+g g +g d+m b+m
The road collapsed, The rogue collapsed. Did you go to the Byrd concert on Monday? Did you go to the Berg concert on Monday? Was the lead covered, did you notice? Was the leg covered, did you notice? Did that new fad catch on? Did that new fag catch on?* They did gardens for rich people, They dig gardens for rich people. It's improper to bed girls for money, It's improper to beg girls for money. A generous bride must be popular, A generous bribe must be popular.
Note: *Fag is colloquial British for cigarette. interest to be associated with rate changes, it was decided to use a phonetician rather than a naive speaker to produce the data. The phonetician was recorded reading a set of "minimal-pair" sentences. The discussion below focuses on the sentences containing a lexical /d/ at the assimilation site, and these are given in table 10.2 with their paired nonalveolar controls. Each sentence was read many times with perceptually identical rate, rhythm, and prosody, but with a variety of degrees of assimilation (in the case of the lexical alveolar sentences). In no case was the first consonant of a sequence released orally. The utterances were recorded not only acoustically, but also electropalatographically. The EPG traces were used to select, for each lexical alveolar sentence, a token archetypical of each of the categories full alveolar, residual alveolar, and zero alveolar as observed previously in EPG recordings. A non alveolar control token was also chosen. The tokens selected were analyzed spectrographically to confirm, in terms of a good duration match, the similarity of rate of the tokens for a given sentence. A test tape was constructed to contain, in pseudo-randomized order (adjacent tokens of the same sentence, and adjacent occurrences of the same articulation type, were avoided), four repetitions of the nonalveolar token, and two repetitions of each of the full-alveolar, residual-alveolar, and zeroalveolar version of the lexical alveolar sentence. 269
Segment
Two groups of subjects were used in the identification test: fourteen professional phoneticians from the universities of Cambridge, Leeds, Reading, and London; and thirty students (some with a limited amount of phonetic training). All subjects were naive as to the exact purpose of the experiment. In the identification task, subjects heard each token once and had to choose on a response sheet, on which was printed the sentence context for each token, which of the two possible words shown at the critical point they had heard. The phoneticians subsequently performed a more arduous task, an analysis test, which involved listening to each sentence as often as necessary, making a narrow transcription of the relevant word, performing a "considered" lexical identification, and attempting to assign its final consonant to one of the four EPG articulation types, these being explained in the instructions for the analysis test. The results for the group of phoneticians are discussed in detail in Kerswill and Wright (1989). Results for the identification test are summarized in figure 10.4. Performance of phoneticians (filled squares) and students (open squares) is very similar, which suggests that even if the phonetician's skills enable them to hear more acutely, the conditions of this experiment did not allow scope for the application of those skills. As might be expected, articulation type 1 (full alveolar) allows the alveolar to be identified with almost complete reliability. With type 2 (residual alveolar), less than half the tokens are correctly identified as words with lexical alveolars. With type 3 (zero alveolar), responses are not appreciably different from those for nonalveolars (type 4). The main reason that not all nonalveolar tokens are judged correctly (which would result in a 0 percent value in fig. 10.4) is presumably that subjects are aware, from both natural-language experience and the ambiguous nature of the stimuli in the experiment, that assimilation can take place, and are therefore willing to "undo" its effect and report alveolars even where there is no evidence for them. So far, the results seem to represent a fairly simple picture. Tokens of lexical alveolars which are indistinguishable in EPG traces from control nonalveolar sequences also appear to sound like the relevant nonalveolar tokens. Tokens of lexical alveolars with residual-alveolar gestures are at least ambiguous enough not to sound clearly like nonalveolars, and are sometimes judged "correctly" to be alveolar words. But there are hints in the responses as a whole that the complete picture may be more complicated. Table 10.3 shows the responses to types 3 (zero alveolar) and 4 (nonalveolar) broken down word pair by word pair. The responses are shown for the student group and for the phoneticians identifying under the same "one-pass" condition, and then for the phoneticians making their "considered" identifications when they were allowed to listen 270
10 Francis Nolan 100 phoneticians Si
80
students
2? 6 0 CD
f 40 CD O
£ 20
2
3
4
Articulation type
Figure 10.4 Identification test: percentage of /d/ responses (by phoneticians and students separately) broken down by articulation type. The articulation type of a token was determined by EPG. The articulation types are (1) full alveolar, (2) residual alveolar, (3) zero alveolar, (4) nonalveolar
repeatedly. The "difference" columns show the difference (in percentage) between "alveolar" responses to type 3 and type 4 stimuli, and a large positive value is thus an indication of more accurate identification of the members of a particular pair. It is striking that in the phoneticians' "considered" identifications, two pairs stand out as being more successfully identified, namely leg-lead (/led/) and beg-bed, with difference values of 40 and 36 respectively. Is there anything about these pairs of stimuli which might explain their greater identifiability? Examination of their EPG patterns suggests a possible answer. The patterns for these pairs are shown in figure 10.5. The normal expectation from the earlier EPG studies would be that the lexical alveolar member of a pair can have contact further forward on the palate, particularly at the sides, than its nonalveolar counterpart. Of course, this possibility is excluded here by design, otherwise the lexical alveolar token would not be "zero alveolar" but "residual alveolar." But the interesting feature is that the «o«alveolar words {leg, beg) show evidence of contact further forward on the palate than their zero-alveolar counterparts. This could, perhaps, reflect random variation, exposed by the criteria used to ensure that zero-alveolar tokens genuinely were such, and not residual alveolars. Bit if that were the 271
Segment
Table 10.3 Percentage of 'lexical alveolar" identifications for zero-alveolar (z-a) and nonalveolar (n-a) tokens by students and phoneticians in one-pass listening, and by phoneticians in making "considered" judgments, broken down by minimal pairs. The difference column (diff.) shows the difference in percentage between the two articulation types. The percentage values for students derive from 60 observations (2 responses to each token by each of 30 students), and for phoneticians in each condition from 28 observations (2 responses to each token by each of 14 phoneticians). The pairs are ordered by the phoneticians' "considered" identifications in descending order Students
Phoneticians
one-pass identification
lead/leg bed/beg fad/fag did/dig bride/bribe road/rogue Byrd/Berg
one-pass identification
considered identification
z-a
n-a
aW-
z-a
n-a
diff.
z-a
n-a
diff-
18 10 28 18 25 18 15
8 5 19 16 23 12 10
10 5 9 2 2 6 5
18 21 7 29 32 39 14
18 7 11 25 27 25 27
0 14 -4 4 5 14 -13
54 57 29 57 54 57 43
14 21 14 45 43 50 48
40 36 15 12 11 7 -5
explanation for the patterns, it is unlikely that any enhanced identification would have emerged for these pairs. A more plausible and interesting hypothesis is that the tongue configuration in realizing lexical /dg/ sequences, regardless of the extent to which the alveolar closure is achieved, is subtly different from that for /gg/ sequences. Specifically, it may be that English alveolars involve, in conjunction with the raising of the tongue tip, a certain hollowing of the tongue front which is incompatible with as fronted a velar closure as might be usual in a velar following a front vowel. The suggestion, then, is that the underlying alveolar specification is still leaving a trace in the overall articulatory gesture, even though the target normally thought of as primary for an alveolar is not being achieved. Auditorily, it seems that the vowel allophone before the lexical velar is slightly closer than before the lexical alveolar. Spectrographic analysis of the tokens, however, has not immediately revealed an obvious acoustic difference, but more quantitative methods may prove more fruitful. The strong hypothesis which is suggested by these preliminaryfindingsis that differences in lexical phonological form will always result in distinct 272
10 Francis Nolan
Nonalveolar (leg) 2141
2142
2143
2144
2145
2146
2147
2148
2149
2929
2930
21S
Zero alveolar (lead)
. .00
00
00
00. .
Nonalveolar {beg) 2928
2931
Zero alveolar (bed) Figure 10.5 EPG patterns for the tokens of leg/lead and beg/bed used in the identification and rating experiments 273
Segment
articulatory gestures. This is essentially the same as the constraint on phonological representations suggested, in the context of a discussion of neutralization, by Dinnsen (1985: 276), to the effect that "every genuine phonological distinction has some phonetic reflex, though not necessarily in the segments which are the seat of the distinction." Testing this hypothesis in the domain of assimilation will require a substantial amount of data from naive subjects, and a combination of techniques, including perhaps radiography and electromyography. Careful thought will also have to be given to whether an independently motivated distinction can be justified between two classes of assimilation: the type under discussion here, which is in essence optional in its operation; and a type which represents an obligatory morphophonological process. Lahiri and Hankamer (1988), for instance, discuss the process in Bengali whereby underlying /r + t/ (where + is a morpheme boundary) undergoes total assimilation to create geminate [tt]. This sequence is apparently phonetically identical to the realization of underlying /t +1/, suggesting that there is no barrier to true neutralization in the course of the morphophonologization of an assimilation process. First, however, as a further step in investigating optional assimilations short of more sophisticated instrumental techniques, an extension of the present experiment was tried which could indirectly have a bearing on the issue. It was reasoned that any evidence that listeners perceived more alveolars for zero-alveolar tokens than for nonalveolar tokens would indicate the presence of utilizable traces of the lexical alveolar. So far, there has only been very slight evidence of this, in the "considered" judgments of the phoneticians. But the identification task in the preceding experiment actually poses quite a stiff task. Suppose the "trace" in effect consists of a fine difference in vowel height. Whilst the direction of such an effect might be expected to be consistent across speakers, such is the variation in vowel height across different individuals that it might be very difficult to judge a token in isolation. It was therefore proposed to carry out the identification test again (on new subjects), but structured as a comparison task. A test tape was constructed using the stimuli as in the first identification experiment, but this time arranged so that in each trial a "lexical alveolar" sentence token of whatever articulation type was always heard paired one second before or after the relevant "nonalveolar" control. In this way any consistent discriminable trace would have the best chance of being picked up and anchored relative to the speaker's phonetic system. The subjects' task was then to listen for a target word (e.g. leg), and note on a response sheet whether it occurred in the first or the second sentence heard in the trial. The design was balanced, so that half the time the target was, e.g., leg and half the time lead; and half the time the "correct" answer was thefirstsentence, and half the time the second. 274
10 Francis Nolan 100
CO
80
Zo
60
CD
CD O CD
40
20
Articulation type Figure 10.6 "Comparison" identification test: percentage correct identification of target alveolar/nonalveolar words from minimal-pair sentences. Each nonalveolar sentence token was paired with tokens classified as (1) full alveolar, (2) residual alveolar, (3) zero alveolar
In all, ten sentence pairs were used (the seven in table 10.2 plus three with final nasals), and with each of three "alveolar" articulation-types paired with the nonalveolar control and presented in four conditions (to achieve balance as described above), the test tape contained 120 trials. The results, for twenty naive listeners, are summarized in figure 10.6. The percentage of correct identifications as either an alveolar or a nonalveolar word is shown for stimuli consisting of, respectively, a full-alveolar token (1), a residual-alveolar token (2), and a zero-alveolar token (3) each paired with the relevant nonalveolar control. It can be seen that even in the case of pairs containing zero alveolars, correct identification, at 66 percent, is well above chance (50 percent) - significantly so (p< 0.0001) according to preliminary statistical analysis using a chi-squared test. Inspection of individual lexical items reveals, as found in the phoneticians' "considered" identifications, that some pairs are harder to identify than others. The finding of better-than-chance performance in zero-alveolar pairs is hard to explain unless the lexical distinction of alveolar versus nonalveolar does leave traces in the articulatory gesture, even when there is no EPG evidence of an alevolar gesture. More generally, the finding is in accord with the hypothesis that differences in lexical phonological form will always result in distinct articulatory gestures. Notice, furthermore, that listeners can not only discriminate the zero-alveolar/nonalveolar pairs, but are able to relate 275
Segment
the nature of difference to the intended lexical form. This does not, of course, conflict wth the apparent inability of listeners to identify zero alveolars "correctly" in isolated one-pass listening; the active use of such fine phonetic detail may only be possible at all when "anchoring" within the speaker's system is available. 10.4 Summary and discussion It is clear that the work reported only scratches the phonetic surface of the phenomenon of assimilation of place of articulation, and, if it is admitted under the heading of "experimental phonology," it is experimental not only in the usual sense, but also in the sense that it tries out new methods whose legitimacy may turn out to be open to question in the light of future work on a grander scale. Nevertheless, the work has served to define what seem to be fruitful questions, and provide provisional answers. To recap: the specific questions, and their provisional answers, were: 1 Does articulation mirror the discrete change implied by phonetic and phonological representations of assimilation? No, (optional) place assimilation is, in articulatory terms, very clearly a gradual process. Viewed electropalatographically, intermediate forms ("residual alveolars") exist in which the tongue appears to make the supporting lateral gesture towards the alveolar ridge in varying degrees, but no median closure is achieved at the alveolar ridge. Other tokens are indistinguishable from the realizations of nonalveolar sequences, suggesting complete assimilation; but with the proviso that the EPG data may not capture residual configurational differences falling short of contact. In a few tokens the EPG traces themselves give evidence of such configurational differences. Because of the limitations of the technique, it has not been possible to show whether phonologically distinct lexical forms are ever realized totally identically by virtue of optional place-assimilation. 2 If assimilation turns out to be a gradual process, how is the articulatory continuum of forms responded to perceptually? As might be expected, identification of lexical alveolars where there is complete alveolar closure is highly reliable. Residual alveolars appear to be ambiguous; they are perceived "correctly" rather under half the time. As far as zero-alveolar forms are concerned, there is no evidence that alveolar cues, if any, can be utilized in normal "one-pass" listening, but in certain vowel environments phoneticians may have some success under repeated listening. On the other hand naive listeners were able to achieve better-than-chance identification of zero alveolars when presented directly with the zero276
10 Francis Nolan
alveolar/nonalveolar contrast, showing both that residual cues to the place distinction survive even here, and that listeners are at some level aware of the nature of these cues. How, then, should place assimilation be modeled? It is certain that the phonetic facts are more complicated than the discrete switch in node association implied by figure 10.1. This would only, presumably, account for any cases where the realization becomes identical with the equivalent underlying nonalveolar. One straightforward solution presents itself within the notational framework of autosegmental phonology. The node on the supralaryngeal tier for the first consonant could be associated to the place node of the second consonant without losing its association to its "own" place features. This could be interpreted as meaning, for the first consonant, "achieve the place target for the second segment, without entirely losing the original features of the first," thus giving rise to residual alveolars (stage [b] below). A further process would complete the place assimilation (stage [c]): (2)
Timing tier
C
C
Supralaryngeal tier
•
•
Place tier
•
•
C
•
(a)
(b)
C
C
•
•
C
• (c)
Unfortunately, there is no reason other than goodwill why that notation should be so interpreted. Since simultaneous double articulations (alveolar plus velar, alveolar plus labial) are perfectly feasible, and indeed are the norm for at least part of the duration of stop clusters in English when assimilation has not taken place, there is no reason for either of the place specifications to be downgraded in priority. Association lines do not come in different strengths - they are association lines pure and simple; and there is no independently justified principle that a node associated to only one node higher in the hierarchy be given less priority in implementation than one associated to two higher nodes. It seems, then, that a phonological notation of this kind is still too bound to the notions of discreteness and segmentality to be appropriate for modeling the detail of assimilation. Given the gradual nature of assimilation, it is therefore natural to turn to an account at the implementational level of speech production; that is, the level at which effects arise because of mechanical and physiological constraints within the vocal mechanism, and inherent characteristics of its "programming." Prime candidate for modeling assimilation here might be the notion of 277
Segment
"co-production" (e.g. Fowler 1985: 254ff.). This conceptualization of coarticulation sees a segment as having constituent gestures associated with it which extend over a given timespan. The constituent gestures do not necessarily all start and stop at the same time, and they will certainly overlap with gestures for abutting segments. Typically, vowels are spoken of as being coproduced with consonants, but presumably in a similar way adjacent consonants may be coproduced. The dentality of the (normally alveolar) lateral, and the velarization of the dental fricative, in a word such as filth, might be seen as the result of coproduction, presumably as a result of those characteristics of tongue-tip, and tongue-body, gestures having respectively a more extensive domain. Problems with such an account, based on characteristics of the vocal mechanism and its functioning, soon emerge when it is applied to what is known about place assimilation. For one thing, the same question arises as with the phonological solution rejected above: namely, that true coproduction would lead not to assimilation, but to the simultaneous achievement of both place targets - double stops again. To predict the occurrence of residual-alveolar and zero-alveolar forms it might be possible to enhance the coproduction model with distinctions between syllable-final and syllableinitial consonants, and a convention that targets for the former are given lower priority; but this seems to be creeping away from the spirit of an account based solely in the vocal mechanism. Even then, unless it were found that the degree of loss of the first target were rigidly correlated with rate of speech, it is not clear how the continuum of articulation types observed could be explained mechanically. More serious problems arise from the fact that place-assimilation behavior is far from universal. This observation runs counter to what would be expected if the behavior were the result of the vocal mechanism. Evidence of variation in how adjacent stops of different places of articulation are treated is far from extensive, but then it has probably not been widely sought up to now. However, there are already indications of such variation. Kerswill (1987: 42, 44) notes, on the basis of an auditory study, an absence or near absence in Durham English of place assimilation where it would be expected in many varieties of English. And in Russian, it has been traditionally noted that dental/alveolar to velar place assimilation is much less extensive than in many other languages, an observation which has been provisionally confirmed by Barry (1988) using electropalatography. If it becomes firmly established that place assimilation is variable across languages, it will mean that it is a phenomenon over which speakers have control. This will provide further evidence that a greater amount of phonetic detail is specified in the speaker's phonetic representation or phonetic plan than is often assumed. Compare, for instance, similar arguments recently used in 278
10 Francis Nolan
connection with stop epenthesis (Fourakis and Port 1986) and the microtiming of voicing in obstruents (Docherty 1989). It may, or may not, have been striking that so far this discussion has tacitly accepted a view which has been commonplace among phonologists particularly since the "generative" era: namely, that the "performance" domain relevant to phonology is production. This, of course, has not always been the case. Jakobson, Fant, and Halle (1952: 12) argued very cogently that the perceptual domain is the one most relevant to phonology: The closer we are in our investigation to the destination of the message (i.e. its perception by the receiver), the more accurately can we gage the information conveyed by its sound shape ... Each of the consecutive stages, from articulation to perception, may be predicted from the preceding stage. Since with each subsequent stage the selectivity increases, this predictability is irreversible and some variables of any antecedent stage are irrelevant for the subsequent stage. Does this then provide the key to the phonological treatment of assimilation? Suppose for a moment that residual-alveolar forms were, like zero-alveolar forms, indistinguishable from lexical nonalveolars in one-pass listening. Would it then, on the "perceptual-primacy" view of the relation of phonology to performance, be legitimate to revert to the treatment of assimilation shown in figure 10.1 - saying in effect that as far as the domain most relevant to the sound patterning of language is concerned, assimilation happens quite discretely? This is an intriguing possibility for that hypothetical state of affairs, but out of step with the actual findings of the identification experiment, in which residual-alveolar forms allowed a substantial degree of correct lexical identification. For a given stimulus, of course, the structure of the experiment forces a discrete response (one word or the other); but the overall picture is one in which up to a certain point perception is able to make use of partial cues to alveolarity. Whatever the correct answer may be in this case, the facts of assimilation highlight a problem which will become more and more acute as phonologists penetrate further towards the level of phonetic detail: that is, the problem of what precisely a phonology is modeling. One type of difficulty emerged from the fact that the level of phonetic detail constitutes the interface between linguistic structure and speech performance, and therefore between a discrete symbol system and an essentially continuous event. It is often difficult to tell, from speechperformance data itself (such as that presented here on assimilation), which effects are appropriately modeled symbolically and which treated as continuous, more-or-less effects. A further difficulty emerged from the fact that phonetic detail does not 279
Segment
present itself for analysis directly and unambiguously. Phonetic detail can only be gleaned by examining speech performance, and speech performance has different facets: production, the acoustic signal, and perception, at least. Perhaps surprisingly, these facets are not always isomorphic. For instance, experimental investigation of phonetic detail appears to throw up cases where produced distinctions are not perceived (see work on sound changes in progress, such as the experiment of Costa and Mattingly [1981] on New England [USA] English, which revealed a surviving measurable vowelduration difference in otherwise minimal pairs such as cod-card, which their listeners were unable to exploit perceptually). In such cases, what is the reality of the linguistic structure the phonologist is trying to model? 10.5 Conclusions
This paper has aimed to show that place assimilation is a fruitful topic of study at the interface of phonology and experimental phonetics. It has been found that place assimilation happens gradually, rather than discretely, in production; and that residual cues to alveolars can be exploited with some degree of success in perception. It is argued that the facts of place assimilation can be neither modeled adequately at a symbolic, phonological level, nor left to be accounted for by the mechanics of the speech mechanism. Instead, they must be treated as one of those areas of subcontrastive phonetic detail over which speakers have control. The representation of such phenomena is likely to require a more radical break from traditional segmental notions than witnessed in recent phonological developments. Clearly, much remains to be done, both in terms of better establishing the facts of assimilation of place of articulation, and, of course, of other aspects of production, and in terms of modeling them. It is hoped that this paper will provoke others to consider applying their own techniques and talents to this enterprise.
Comments on chapter 10 BRUCE HAYES The research reported by Nolan is of potential importance for both phonetic and phonological theory. The central claim is as follows. The rule in (3), which is often taught to beginning phonology students, derives incorrect outputs. In fluent speech, the /t/ of late calls usually does not become a /k/, but rather becomes a doubly articulated stop, with both a velar and an 280
10 Comments alveolar closure: [lei{t}kDilz]. The alveolar closure varies greatly in its strength, from full to undetectable. (3)
[ a place ]
alveolar stop
/.
C a place
Nolan takes care to point out that this phenomenon is linguistic and not physiological - other dialects of English, and other languages, do not show the same behavior. This means that a full account of English phonology and phonetics must provide an explicit description of what is going on. Nolan also argues that current views of phonological structure are inadequate to account for the data, suggesting that "The representation of such phenomena is likely to require a more radical break from traditional segmental notions than witnessed in recent phonological developments." As a phonologist, I would like to begin to take up this challenge: to suggest formal mechanisms by which Nolan's observations can be described with explicit phonological and phonetic derivations. In fact, I think that ideas already in the literature can bring us a fair distance towards an explicit account. In particular, I want to show first that an improved phonological analysis can bring us closer to the phonetic facts; and second, by adopting an explicit phonetic representation, we can arrive at least at a tentative account of Nolan's data. Consider first Nolan's reasons for rejecting phonological accounts of the facts. In his paper, he assumes the model of segment structure due to Clements (1985), in which features are grouped within the segment in a hierarchical structure. For Clements, the place features are grouped together under a single PLACE node, as in (4a). Regressive place assimilation would be expressed by spreading the PLACE node leftward, as in (4b). (4)a.
PLACE
[ant] [cor] [distr] etc. b.
C
C
C
SUPRA
SUPRA
SUPRA
SUPRA
PLACE
PLACE
PLACE
PLACE
[ +cor] [ —cor]
[ + cor] [ —cor] 281
Segment
A difficulty with this account, as Nolan points out, is that it fails to indicate that the articulation of the coronal segment is weak and variable, whereas that of the following noncoronal is robust. However, it should be remembered that (4b) is meant as a phonological representation. There are good reasons why such representations should not contain quantitative information. The proper level at which to describe variability of closure is actually the phonetic level. I think that there is something more fundamentally wrong with the rule in (4b): it derives outputs that are qualitatively incorrect. If we follow standard phonological assumptions, (4b) would not derive a double articulated segment, but rather a contour segment. The rule is completely analogous to the tonal rule in (5), which derives a contour falling tone from a High tone by spreading. (5)
= falling tone + low tone
Following this analogy, the output of rule (4b) would be a contour segment, which would shift rapidly from one place of articulation to another. If we are going to develop an adequate formal account of Nolan's findings, we will need phonological and phonetic representations that can depict articulation in greater detail. In fact, just such representations have been proposed in work by Sagey (1986a), Ladefoged and Maddieson (1989), and others. The crucial idea is shown in (6): rather than simply dominating a set of features, the PLACE node dominates intermediate nodes, corresponding to the three main oral articulators: LABIAL for the lips, CORONAL for the tongue blade, and DORSAL for the tongue body. (6)
PLACE
LABIAL
[round]
[ant]
[distr]
[back]
[high]
[low]
The articulator nodes are not mutually exclusive; when more than one is present, we get a complex segment. One example is the labiovelar stop, depicted as the copresence of a LABIAL and a DORSAL node under the same PLACE node. (7)
[gjb]:
PLACE LABIAL
DORSAL
282
10 Comments Note that the LABIAL and DORSAL nodes are intended to be simultaneous, not sequenced. Representations like these are obviously relevant to Nolan's findings, because he shows that at the surface, English has complex segments: for example, the boldface segment in late calls [lei{ t}ko:lz] is a coronovelar, and the boldface segment in good batch [go{J}baetJ] is a labiocoronal. A rule to derive the complex segments of English is stated in (8). The rule says that if a syllable-final coronal stop is followed by an obstruent, then the articulator node of the following obstruent is spread leftward, sharing the PLACE node with the CORONAL node. (8)
Place Assimilation Spread the articulator node of a following obstruent leftward, onto a syllable-final [-SX£S£nt] segment.4
In (9) is an illustration of how the rule works. If syllable-final /t/ is followed by /k/, as in late calls, then the DORSAL articulator node of the /k/ is spread leftward. In the output, it simultaneously occupies a PLACE node with the original CORONAL node of the /t/. The output of the rule is therefore a corono-dorsal complex segment. (9)
late calls: / t k / -> C
C
PLACE
PLACE
[{\}k]
i
COR
DORS
To complete this analysis, we have to provide a way of varying the degree of closure made by the tongue blade. Since phonological representations are discrete rather than quantitative, they are inappropriate for carrying out this task. I assume then, following work by Pierrehumbert (1980), Keating (1985, 1988a), and others, that the grammar of English contains a phonetic component, which translates the autosegments of the phonology into quantitative physical targets. The rule responsible for weakening alveolar closures is a phonetic rule, and as such it manipulates quantitative values. There is an additional issue involved in the expression of (8): the rule must generalize over the class of articulator nodes without actually spreading the PLACE node that dominates them. Choi (1989), based on evidence from Kabardian, suggests that this may in fact be the normal way in which class nodes operate: they define sets of terminal nodes that may spread, but do not actually spread themselves.
283
Segment
The form of rules in the phonetic component is an almost completely unsettled issue. For this reason, and for lack of data, I have stated the phonetic rule of Alveolar Weakening schematically as in (10): (10)
Alveolar Weakening
Depending on rate and casualness of speech, lessen the degree of closure for a COR autosegment, if it is [ — continuant] and syllable-final. In (11) is a sketchy derivation showing how the rule would apply. We start with the output of the phonology applying to underlying /kt/, taken from (9). Next, the phonetic component assigns degree-of-closure targets to the CORONAL and DORSAL autosegments. Notice that the target for the DORSAL autosegment extends across two C positions, since this autosegment has undergone spreading. Finally, the rule of Alveolar Weakening lessens the degree of closure for the CORONAL target. It applies variably, depending on speech style and rate, but I have shown just one possible output. (11)
a. Output of phonology C
C
PLACE
PLACE
C6R DORS
b. Transla tio quantitative targets C 1
C
weakening
C
C ALVEOLAR
X
closure value
0 1
xxxxx
xxxxx
DORSAL
closure value
0
This analysis is surely incomplete, and indeed it may turn out to be entirely wrong. But it does have the virtue of leading us to questions for further research, especially along the lines of how the rules might be generalized to other contexts. To give one example, I have split up what Nolan treats as a single rule into two distinct processes: a phonological spreading rule, Place Assimilation (8); plus a phonetic rule, Alveolar Weakening (10). This predicts that in principle, one rule might apply in the absence of the other. I believe that this is in fact true. For example, the segment /t/ is often weakened in its articulation even when no other segment follows. In such cases, the weakened /t/ is usually 284
10 Comments
"covered" with a simultaneous glottal closure. Just as in Nolan's data, the degree of weakening is variable, so that with increasingly casual speech we can get a continuum like the following for what: [wAt], [wAt?], [WA?1], [WA?]. It is not clear yet how Alveolar Weakening should be stated in its full generality. One possibility is that the alveolar closures that can be weakened are those that are "covered" by another articulation. A full, accurate formulation of Alveolar Weakening would require a systematic investigation of the behavior of syllable-final alveolars in all contexts. My analysis also raises a question about whether Nolan is right in claiming that Place Assimilation is a "gradual process." It is clear from his work that the part of the process I have called Alveolar Weakening is gradual. But what about the other part, which we might call "Place Assimilation Proper"? In my analysis, Place Assimilation Proper is predicted to be discrete, since it is carried out by a phonological rule. The tokens that appear in Nolan's paper appear to confirm this prediction, which I think would be worth checking systematically. If the prediction is not confirmed, it is clear that the theory of phonetic representation will need to be enriched in ways I have not touched on here. Another area that deserves investigation is what happens when the segment that triggers Place Assimilation is itself coronal, as in the dental fricatives in get Thelma, said three, and ten things. Here, it is impossible to form a complex segment, since the trigger and the target are on the same tier. According to my analysis, there are two possible outcomes. If the CORONAL autosegment on the left is deleted, then the output would be a static dental target, extending over both segments, as in (12b). But if there is no delinking, as in (12c), then we would expect the /t/ to become a contour segment, with the tongue blade sliding from alveolar to dental position. (12)
a. Applying place assimilation to jtOj PLACE
PLACE
COR
COR
b. Output with delinking: [tG] PLACE
c. Output without delinking: [tt6]
PLACE
PLACE
PLACE
COR
COR
COR
I I I
II [ - distr ] [ + distr ]
[ + distr ]
[ - distr ] [ + distr ]
My intuitions are that both outcomes are possible in my speech, but it is clear that experimental work is needed. To sum up, I think Nolan's work is important for what it contributes to the eventual development of a substantial theory of phonetic rules. I have also tried to show that by adopting the right phonological representation as 285
Segment
the input to the phonetic component (i.e. an autosegmental one, with articulator nodes), the task of expressing the phonetic rules can be simplified.
Comments on chapter 10 JOHN J. OHALA Nolan's electropalatographic study of lingual assimilation has given us new insight into the complexities of assimilation, a process which phonologists thought they knew well but which, the more one delves into it, turns out to have completely unexpected aspects. To understand fully place assimilation in heterorganic medial clusters I think it is necessary to be clear about what kind of process assimilation is. I presume all phonologists would acknowledge that variation appears in languages due to "on-line" phonetic processes and due to sound changes. The former may at one extreme be purely the result of vocal-tract constraints; for example, Lindblom (1963) made a convincing case for certain vowel-reduction effects being due to inertial constraints of the vocal mechanism. Sound change, at the other extreme, may leave purely fossilized variant pronunciations in the language, e.g., cow and bovine, both from Proto-Indo-European *gwous. Phonetically caused variation may be continuous and not represent a change in the instructions for pronunciation driving the vocal tract. Sound change, on the other hand, yields discrete variants which appear in speech due to one or the other variant form having different instructions for pronunciation. There are at least two complicating factors which obscure the picture, however. First, it is clear that most sound changes develop out of low-level phonetic variation (Ohala 1974, 1983), so it may be difficult in many cases to differentiate continuous phonetic variation from discrete variation due to sound change. Second, although sound changes may no longer be completely active, they can exhibit varying degrees of productivity if they are extrapolated by speakers to novel lexical items, derivations, phrases, etc. It was a rather ancient sound change which gave us the k > s change evident in skeptic ~ skepticism but this does not prevent some speakers from extending it in novel derivations like domesticism (with a stem-final [s]) (Ohala 1974). I think place assimilation of medial heterorganic clusters in English may very well be present in the language due to a sound change, but one which is potentially much more productive than velar softening. Nevertheless, its full implementation could still be discrete. In other words, I think there may be a huge gap between the faintest version of an alveolar stop in red car and the fully assimilated version [reg ka:]. Naturally, the same phonetic processes which originally gave rise to the 286
10 Comments
sound change can still be found in the language, i.e. imperfect articulation of Cl and thus the weakening of the place cues for that consonant vis-a-vis the place cues for C2. In the first Laboratory Phonology Conference (Ohala 1990a) I presented results of an experiment that showed that artificial heterorganic stop + stop and nasal + stop clusters may be heard as homorganic if the duration of the closures is less than a certain threshold value; in these cases it was the place of C2 which dominated the percept. In that case there was no question of the Cl's being imperfectly articulated: rather simply that the place cues for C2 overshadowed those of Cl. The misinterpretation of a heterorganic cluster C1C2 as a homorganic one was discrete and, I argued, a duplication of the kind of phonetic event which led to such sound changes as Late Latin okto > Italian otto. I think a similar reading can be given to the results of Nolan's perceptual test (summarized in his fig. 10.4). For the residual-alveolar tokens, slightly less than half of all listeners (linguists and nonlinguists combined) identified the word as having an alveolar Cl. I am not saying that this discrete identification change is the discrete phonological process underlying English place assimilation, just that it mirrors the process that gave rise to it. Another example may clarify my point. No doubt, most would agree that the alternation between /t/ and /tf/ as in act and actual([aekt] [aektjual]) is due to a sound change. (That most speakers do not "derive" actual from act is suggested by their surprise or amusement on being told that the two are historically related.) Nevertheless, we can still find the purely phonetic processes which gave rise to such a change by an examination of the acoustic properties of [t]s released before palatal glides or even palatal vowels: in comparison to the releases before other, more open, vowels, these releases are intense and noisy in a way that mimics a sibilant fricative. No doubt the t—>tj sound change arose due to listeners misinterpreting and thus rearticulating these noisily released [t]s as sibilant affricates (Ohala, 1989). What wefindin a synchronic phonetic analysis may very well be the "seeds" of sound change but it was the past "germination" of such seeds which gave rise to the current discrete alternations in the language; the presence of the affricate in actual is not there due any change per se being perpetrated by today's speakers or listeners.
Comments on chapter 10 CATHERINE P. BROWMAN Nolan's study presents articulatory evidence indicating that assimilation is not an all-or-none process. It also presents perceptual evidence indicating that lexical items containing alveolars can be distinguished from control 287
Segment
utterances (with no alveolars) even when there is no electropalatographic evidence for tongue-tip articulations. The study addresses the question of how such articulations differ from the control utterances - an important point. However, it is difficult to evaluate the evidence presented, since the articulation type that is critically important to this type of assimilation "zero alveolar" - is not clearly defined, largely due to the lack of temporal information. Infigures10.7a-c, the top panels portray three possible definitions in terms of articulatory gestures (see e.g. Browman and Goldstein 1986, 1989). The top panel in figure 10.7a shows the simplest possibility, and the one suggested by the name "zero alveolar": gestural deletion, effectively the end of the continuum of reduction. However, testing this possibility requires that the behavior of the velar gestures in the control utterances (e.g. ... make calls...) be clearly understood. What is the nature of the velar-velar configuration to which the assimilation is being compared? Are there two partially overlapping velar gestures, as in the top panel of figure 10.7, or have the velar gestures slid so as to be completely superimposed, as in the bottom panel? In the former case, there would be a clear durational difference between zero-alveolar assimilations, defined as deleted gestures, and the control utterances; in the latter, the two would not be clearly distinguishable in terms of duration. The top panel in figure 10.7b portrays a second possible definition of "zero alveolar," one discussed by Nolan: a deleted tongue-tip articulation, with a secondary tongue-front articulation remaining. This case would be clearly distinguishable from both velar-velar configurations. The top panel in figure 10.7c portrays a third possibility, suggested by the notation of autosegmental phonology: the tongue-tip gesture is replaced by a tongue-body gesture (relinking on the place tier). As in the case of gestural deletion (figure 10.7a), this case of gestural replacement can be evaluated only if the nature of the control velar-velar configuration is known. Only if the two velars are completely overlapping would there be an obvious difference between gestural replacement and the control velar-velar configuration. It should be noted that there is another possible source of perceptual assimilation: increased gestural overlap. This is portrayed in the bottom panels of figures 10.7a-c, which show full articulations for both the tongue tip and the tongue body, but completely overlapping. Zsiga and Byrd (1988) provided a preliminary indication of the validity of a similar analysis of assimilation, using GEST, a computational gestural model being developed at Haskins Laboratories (e.g. Browman et al. 1986; Saltzman et al. 1988a). Zsiga and Byrd created a continuum of increasing overlap between the gestures in the two words in the utterance bed ban, and measured the values of F2 and F3 at the last point before the closure for the [d]. As the bilabial 288
10 Comments Velar-velar
Alveolar-velar Reduction Full
t
Residual
Zero
Partial TB
CD
6 Complete
TT TB
(a)
(b)
Partial
Complete TB (C)
Figure 10.7 Three possible definitions of "zero-alveolar" in terms of articulatory gestures: (a) deletion of tongue-tip gesture; (b) deletion of tongue-tip gesture, with secondary tongue-front articulation remaining; (c) replacement of tongue-tip gesture by tongue-body gesture
gesture (beginning ban) overlapped the alveolar gesture (ending bed) more and more, the formant values at the measurement point moved away from the values for the utterance bed dan, and towards the values of the utterance beb ban, even prior to complete overlap. Most importantly, when the two gestures were completely synchronous, the formant values were closer to those for [b] than those for [d]. Thus, in this case the acoustic consequences of complete overlap were compatible with the hypothesis that assimilation could be the result of increasing overlap between articulatory gestures. 289
II
Psychology and the segment ANNE CUTLER
Something very like the segment must be involved in the mental operations by which human language users speak and understand.* Both processesproduction and perception - involve translation between stored mental representations and peripheral processes. The stored representations must be both abstract and discrete. The necessity for abstractness arises from the extreme variability to which speech signals are subject, combined with the finite storage capacity of human memory systems. The problem is perhaps worst on the perceiver's side; it is no exaggeration to say that even two productions of the same utterance by the same speaker speaking on the same occasion at the same rate will not be completely identical. And within-speaker variability is tiny compared to the enormous variability across speakers and across situations. Speakers differ widely in the length and shape of their vocal tracts, as a function of age, sex, and other physical characteristics; productions of a given sound by a large adult male and by a small child have little in common. Situation-specific variations include the speaker's current physiological state; the voice can change when the speaker is tired, for instance, or as a result of temporary changes in vocal-tract shape such as a swollen or anaesthetized mouth, a pipe clenched between the teeth, or a mouthful of food. Other situational variables include distance between speaker and hearer, intervening barriers, and background noise. On top of this there is also the variability due to speakers' accents or dialects; and finally, yet more variability arises due to speech style, or register, and (often related to this) speech rate. But the variability problem also exists in speech production; we all vary our speech style and rate, we can choose to whisper or to shout, and the "This paper was prepared as an overall commentary on the contributions dealing with segmental representation and assimilation, and was presented in that form at the conference. 290
11 Anne Cutler
accomplished actors among us can mimic accents and dialects and even vocal-tract parameters which are not our own. All such variation means that the peripheral processes of articulation stand in a many-to-one relationship to what is uttered in just the same way as the peripheral processes of perception do. If the lexicon were to store an exact acoustic and articulatory representation for every possible form in which a given lexical unit might be heard or spoken, it would need infinite storage capacity. But our brains simply do not have infinite storage capacity. It is clear, therefore, that the memory representations of language which we engage when we hear or produce speech must be in a relatively abstract (or normalized) form. The necessity for discreteness also arises from the finite storage capacity of our processing systems. Quite apart from the infinite range of situational and speaker-related variables affecting how an utterance is spoken, the set of potential complete utterances themselves is also infinite. A lexicon - that is, the stored set of meaning representations-just cannot include every utterance a language user might some day speak or hear; what is in the lexicon must be discrete units which are smaller than whole utterances. Roughly, but not necessarily exactly, lexical representations will be equivalent to words. Speech production and perception involve a process of translation between these lexical units and the peripheral input and output representations. Whether this process of translation in turn involves a level of representation in terms of discrete sublexical units is an issue which psycholinguists have long debated. Arguments in favor of sublexical representations have been made on the basis of evidence both from perception and from production. In speech perception, it is primarily the problem of segmentation which has motivated the argument that prelexical classification of speech signals into some subword-level representation would be advantageous. Understanding a spoken utterance requires locating in the lexicon the individual discrete lexical units which make up the utterance, but the boundaries between such units - i.e. the boundaries between words-are not reliably signaled in most utterances; continuous speech is just that-continuous. There is no doubt that a sublexical representation would help with this problem, because, instead of being faced with an infinity of points at which a new word might potentially commence, a recognizer can deal with a string of discrete units which offer the possibility of a new word beginning only at those points where a new member of this set of sublexical units begins. Secondly, arguments from speech perception have pointed out that the greatest advantage of a sublexical representation is that the set of potential units can be very much smaller than the set of units in the lexicon. However large and heterogeneous the lexical stock (and adult vocabularies run into 291
Segment
many tens if not hundreds of thousand items), with sublexical representations any lexical item could be decomposed into a selection from a small and finite set of units. Since a translation process between the lexicon and more peripheral processes is necessary in any case, translation into a small set of possibilities will be far easier than translation into a large set of possibilities. Exactly similar arguments have been made for the speech-production process. If all the words in the lexicon can be thought of as being made up of a finite number of building blocks in various permutations, then the translation process from lexical representation to the representation for articulation need only know about how the members of the set of building blocks get articulated, not how all the thousands of entries in the lexicon are spoken. Obvious though these motivating arguments seem, the point they lead to is far from obvious. Disagreement begins when we attempt to answer the next question: what is the nature of the building blocks, i.e. the units of sublexical representation? With excellent reason, the most obvious candidates for the building-block role have been the units of analysis used by linguists. The phoneme has been the most popular choice because (by definition) it is the smallest unit into which speech can be sequentially decomposed. I wish that it were possible to say at this point: the psycholinguistic evidence relating to the segment is unequivocal. Unsurprisingly, however, equivocality reigns here as much as it does in phonology. On the one hand, language users undoubtedly have the ability to manipulate speech at the segmental level, and some researchers have used such performance data as evidence that the phoneme is a level of representation in speech processing. For instance, language games such as Pig Latin frequently involve movement of phoneme-sized units within an utterance (so that in one version of the game, pig latin becomes ig-pay atin-lay). At a less conscious level, slips of the tongue similarly involve movement of phoneme-sized units - "by far the largest percentage of speech errors of all kinds" says Fromkin (1971: 30), involve units of this size: substitution, exchange, anticipation, perseveration, omission, addition - all occur more often with single phonemes than with any other linguistic unit. The on-line study of speech recognition has made great use of the phoneme-monitoring task devised by Foss (1969), which requires listeners to monitor speech for a designated target phoneme and press a response key as soon as the target is detected; listeners have no problem performing this task (although as a caveat it should be pointed out that the task has been commonly used only with listeners who are literate in a language with alphabetic orthography). Foss himself has provided (Foss and Blank 1980; Foss, Harwood and Blank 1980; Foss and Gernsbacher 1983) the strongest recent statements in favor of 292
11 Anne Cutler
the phoneme as a unit of representation in speech processing: "the speech perception mechanisms compute a representation of the input in terms of phoneme-sized units" (Foss, Harwood, and Blank 1980: 185). This argument is based on the fact that phoneme targets can be detected in heard speech prior to contact with lexical representations, as evidenced by the absence of frequency effects and other lexical influences in certain types of phonememonitoring experiment. Doubt was for a time cast on the validity of phoneme-monitoring performance as evidence for phonemic representations in processing, because it was reported that listeners can detect syllable-sized targets faster than phoneme-sized targets in the same speech material (for English, Savin and Bever 1970; Foss and Swinney, 1973; Mills 1980; Swinney and Prather 1980; for French, Segui, Frauenfelder, and Mehler 1981). However, Norris and Cutler (1988) noted that most studies comparing phoneme- and syllablemonitoring speed had inadvertently allowed the syllable-detection task to be easier than the phoneme-detection task - when the target was a syllable, no nontarget items were presented which were highly similar to the target, but when the target was a phoneme, nontarget items with very similar phonemes did occur. To take an example from Foss and Swinney's (1973) experiment, the list giant valley amoral country private middle bitter protect extra was presented for syllable monitoring with the target bit-, and for phoneme monitoring with the target b-. The list contains no nontarget items such as pitcher, battle, or bicker which would be very similar to the syllable target, but it does contain nontarget items beginning with/?-, which is very similar to the phoneme target. In effect, this design flaw allowed listeners to respond in the syllable-target case on the basis of only partial analysis of the input. Norris and Cutler found that when the presence of items similar to targets of either kind was controlled, so that listeners had to perform both phoneme detection and syllable detection on the basis of equally complete analyses of the input, phoneme-detection times were faster than syllable-detection times. Thus there is a substantial body of opinion favoring phonemic units as processing units. On the other hand, there are other psycholinguists who have been reluctant even to consider the phoneme as a candidate for a unit of sublexical representation in production and perception because of the variability problem. To what degree can it be said that acoustic cues to phonemes possess constant, invariant properties which are necessarily present whenever the phoneme is uttered? If there are no such invariant cues, they have argued, how can a phonemic segmentation process possibly contribute to processing efficiency, since surely it would simply prove enormously difficult in its own right? Moreover, at the phoneme level the variability discussed above is further compounded by coarticulation, which makes a phoneme's spoken 293
Segment
form sensitive to the surrounding phonetic context-and the context in question is not limited to immediately adjacent segments, but can include several segments both forwards and backwards in the utterance. This has all added up to what seemed to some researchers like insuperable problems for phonemic representations. As alternatives, both units above the phonemic level, such as syllables, demisyllables, or diphones, and those below it, such as featural representations or spectral templates, have been proposed. (In general, though, nonlinguistic units such as diphones or demisyllables have only been proposed by researchers who are concerned more with machine implementation than with psychological modeling. An exception is Samuel's [1989] recent defense of the demisyllable in speech perception.) The most popular alternative unit has been the syllable (Huggins 1964; Massaro 1972; Mehler 1981; Segui 1984), and there is a good deal of experimental evidence in its favor. Moreover, this evidence is very similar to the evidence which apparently favors the phoneme; thus language games often use syllabically defined rules (Scherzer 1982), slips of the tongue are sensitive to syllabic constraints (MacKay 1972), and on-line studies of speech recognition have shown that listeners can divide speech into syllables (Mehler et al. 1981). Thus there is no unanimity at all in the psycholinguistic literature, with some researchers favoring phonemic representations and some syllabic representations, while others (e.g. Crompton 1982) favor a combination of both, and yet others (e.g. Samuel 1989) opt for some more esoteric alternative. A consensus may be reached only upon the lack of consensus. Recent developments in the field, moreover, have served only to sow further confusion. It turns out that intermediate levels of representation in speech perception can be language-specific, as has been shown by experiments following up thefindingof Mehler et al. (1981) that listeners divide speech up into syllables as they hear it. Mehler et al.'s study was carried out in French; in English, as Cutler et al. (1986) subsequently found, its results proved unreplicable. Cutler et al. pointed out that syllable boundaries are relatively clearer in French than in English, and that this difference would make it inherently more likely that using the syllable as a sublexical representation would work better in French than in English. However, they discovered in further experiments that English listeners could not divide speech up into syllables even when they were listening to French, which apparently encourages such a division; while French listeners even divided English up into syllables wherever they could, despite the fact that the English language fails to encourage such division. Thus it appears that the French listeners, having grown up with a language which encourages syllabic segmentation, learnt to use the syllable as an intermediate representation, whereas English listeners, who had grown up with a hard-to-syllabify language, had learnt not to use it. 294
11 Anne Cutler
In other words, speakers' use of intermediate units of representation is determined by their native language. The reason that this finding muddied the theoretical waters is, of course, that it means that the human language-processing system may have available to it a range of sublexical units of representation. In such a case, there can be no warrant for claims that any one candidate sublexical representation is more basic, more "natural" for the human language-processing system, than any other for which comparable evidence may be found. What relevance does this have to phonology (in general, and laboratory phonology in particular)? Rather little, perhaps, and that entirely negative: it suggests, if anything, that the psychological literature is not going to assist at all in providing an answer to the question of the segment's theoretical status in phonology. This orthogonality of existing psycholinguistic research to phonological issues should not, in fact, be surprising. Psychology has concluded that while the units of sublexical representation in language perception and production must in terms of abstractness and discreteness resemble the segment, they may be many and varied in nature, and may differ from language community to language community, and this leaves phonology with no advance at all as far as the theoretical status of the segment is concerned. But a psychological laboratory is not, after all, the place to look for answers to phonological questions. As I have argued elsewhere (Cutler 1987), an experiment can only properly answer the question it is designed to answer, so studies designed to answer questions about sublexical representations in speech processing are ipso facto unlikely to provide answers of relevance to phonology. When the question is phonological, it is in the phonology laboratory that the answer is more likely to be found.
295
12
Trading relations in the perception of stops and their implications for a phonological theory LIESELOTTE SCHIEFER
12.1 Introduction
All feature sets used in the description of various languages are either phonetically or phonemically motivated. The phonetically motivated features are based on acoustic, articulatory, or physiological facts, whereas phonemically motivated features take into account, for example, the comparability of speech sounds with respect to certain phonemic and/or phono tactic rules. But even if the authors of feature sets agree on the importance of having phonetically adequate features, they disagree on the selection of features to be used in the description of a given speech sample. This is the situation with the description of Hindi stops. Hindi has a complicated system of four stop classes (voiceless unaspirated, voiceless aspirated, voice, and breathy voiced, traditionally called voiced aspirated) in four places of articulation (labial, dental, retroflex, and velar), plus a full set of four affricates in the palatal region. Since Chomsky and Halle (1968), several feature sets have been put forward in order to account for the complexity of the Hindi stop system (Halle and Stevens 1971; Ladefoged 1971; M. Ohala 1979; Schiefer 1984). The feature sets proposed by these authors have in common that they make use only of physiologically motivated features such as "slack vocal cords" (see table 12.1). In what follows, we concentrate on the features proposed by Ladefoged, Ohala, and Schiefer. These authors differ not only according to their feature sets but also with respect to the way these features are applied to the stop classes. Ladefoged groups together (a) the voiceless unaspirated and voiceless aspirated stops by assigning them the value "0" of the feature "glottal stricture," and (b) the breathy voiced and voiceless aspirated stops by giving them the values "2" of "voice-onset time." But he does not consider voiced 296
12 Lieselotte Schiefer Table 12.1 Feature sets proposed by Chomsky and Halle (1968), Halle and Stevens (1971), M. Ohala (1979), and Schiefer (1984)
Halle and Stevens (1971) spread glottis constricted glottis stiff cords slack cords
3
4
5
Ladefoged (1971) glottal stricture voice-onset time M . Ohala (1979) distinctive release delayed release voice-onset time glottal stricture vocal-cord tension Schiefer (1984) distinctive release delayed release voice-onset time vocal-cord tension
Ph
b
bh
+ 1 1 +
2
P
+ 1 1 1
Chomsky and Halle (1968) heightened subglottal air pressure voice tense
+ 1+ 1
1
0 1
0 2
2 0
1 2
P
Ph
b
bh
1 0
2 0
0 2
0 1
1
2
0
2
+
c
and breathy voiced stops to form a natural class, since he assigns " 0 " for the feature "voice-onset time" to the voiced stop. Ohala uses four features (plus "delayed release"). The feature "distinctive release" is used to distinguish the "nonaspirates," voiced and voiceless unaspirated, from voiceless aspirated, breathy voiced, and the affricates. The feature "voice-onset time" is used to build a natural class with the voiced and breathy voiced stops, whereas "glottal stricture" allows her both to build a class with voiceless unaspirated and voiceless aspirated stops, and to account for the difference in the mode of vibration between voiced and breathy voiced stops. Finally, she needs the feature "vocal-cord tension" "to distinguish the voiced aspirates from all other stop types also," as "It is obvious that where the voiced aspirate stop in Punjabi has become deaspirated, a low rising tone on the following vowel has developed" (1979: 80). 297
Segment
Schiefer (1984) used the features "distinctive release," "delayed release," and "vocal-cord tension" in the same way as Ohala did, but followed Ladefoged in the use of the feature "voice-onset time." She rejected the feature "glottal stricture" on grounds which will be discussed later. It is apparent that Ohala's analysis is based on phonetic as well as phonemic arguments. She uses phonetic arguments in order to reject the feature "heightened subglottal air pressure" of Chomsky and Halle (1968) and to favor the features "glottal stricture" and "voice-onset time," which she takes over from Ladefoged (1971). On the other hand, she makes use of a phonemic argument in favor of the feature "delayed release," as "in Maithili there is a rule which involves the de-aspiration of aspirates when they occur in non-utterance-initial syllables, followed by a short vowel and either a voiceless aspirate or a voiceless fricative" (1979: 80). Moreover, in contrast to Ladefoged and Schiefer, Ohala assigns "onset of voicing" to the feature "voice-onset time." This implies that the feature "voice-onset time" is applied to two different acoustic (and physiological) portions of the breathy voiced stop. That is, since Ohala treats the breathy voiced stop in the same way as the voiced one, her feature applies to the prevoicing of the stop, whereas Ladefoged and Schiefer apply the feature to the release of the stop. As already mentioned, Ladefoged, Ohala, and Schiefer all use physiological features, and their investigations are based on relatively small amounts of physiological or acoustic data. None of these authors relies on perceptual results - something which is characteristic of other phonological work as well (see Anderson 1978). The aim of the present paper is therefore to use both an acoustic analysis (based on the productions of four informants) and results from perceptual tests as a source of evidence for specific features. Since I lack physiological data of my own, I will concentrate especially on the feature "voice-onset time." In doing so, I start from the following considerations, (a) If Ohala is right in grouping the voiced and breathy voiced stops together into one natural class by the feature "voice-onset time," then this phonetic feature, namely prevoicing, is a necessary feature in the production as well as in the perception of these stops. Otherwise the function of prevoicing differs between the two stop classes, (b) If the main acoustic (as well as perceptual) information about a breathy voiced stop is located in the release portion, the influence of prevoicing in the perception of this stop should be less important. 12.2 Acoustic properties of Hindi stops
12.2.1 Material and informants The material consisted of 150 different Hindi words containing the breathy voiced stops /bh, dh, dh, gh/ in word-initial position followed by the vowels /a, 298
12 Lieselotte Schiefer
e, o, i, u/. Each stop-vowel combination occurred about ten times in the material. The data were organized in several lists and read in citation form. The first and last member of each list was excluded from analysis in order to control for list effects. Due to this procedure and some mispronunciations (i.e. reading [dh] instead of [gh]) as well as repetitions caused by mistakes, the number of examples which could be used in the acoustic analysis differed between subjects. Four native speakers of Hindi served as informants: S.W.A. (female, 35 years), born in Simla (Himachal Pradesh), raised in Simla and New Delhi; M.A.N. (female, 23 years), born in Piprai (Uttar Pradesh); R.P.J. (male, 40 years), born in New Delhi, and P.U.N. (male, 22 years), born in Mirzapur (Uttar Pradesh). The informants thus come from different dialect areas of Hindi, but it is usually assumed that those dialects belong to the western Hindi dialect group (see Mehrota 1980). All informants speak English and Germanfluently.Except for S.W.A. who speaks Punjabi as well, none of the informants reported a thorough knowledge of any other Indian language. S.W.A. was taped in the sound-proofed room of our institute on a Telefunken M15 tape recorder using a Neumann U87 microphone. The recordings of M.A.N., R.P.J., and P.U.N. were made in the language laboratory of the Centre of German Studies, School of Languages of the Jawaharlal Nehru University in New Delhi using a Uher Report 4002 and a Senheisser microphone. The microphone was placed in front of the informants at a distance of 50 cm. The material was digitized on a PDP11/50, filtered at 8 kHz, and segmented into single acoustic portions from which the duration values were calculated (see for detail Schiefer 1986, 1988, 1989). 12.2.2 Results Only those results which are relevant for this paper will be presented here; further details can be found in Schiefer (1986, 1987, 1988, 1989). Hindi breathy voiced stops can be defined as consisting of three acoustic portions: the prevoicing during the stop closure ("voicing lead"), the release of the stop or the burst, and a breathy voiced vowel following the burst, traditionally called "voiced aspiration." This pattern can be modified under various conditions: the voicing lead can be missing, and/or the breathy voiced vowel can be replaced by a voiceless one (or voiceless aspiration). From this it follows that four different realizations of Hindi breathy voiced stops occur: the "lead" type, which is the regular one, having voiced during the stop closure; and the "lag" type, which lacks the voicing lead. Both types have two different subtypes: the burst can be followed either by a breathy voiced vowel or by voiceless aspiration, which might be called the "voiceless" type of breathy voiced stop. Figure 12.1 displays three typical examples from R.P.J. Figures 12.1a 299
Segment
(a) r
VV\/V \/\r\fl%/\f\J
(*)
Figure 12.1 Oscillograms of three realizations of breathy voiced stops from R.P.J.: (a) /b h ogna/, (b) /b*av/, (c) /g"iya/
(/bhogna/) and 12.1b (/bhav/) represent the "lead" type of stop having regular prevoicing throughout the stop closure, a more (fig. 12.1b) or less (fig. 12.1a) pronounced burst, and a breathy voiced vowel portion following the burst. Note that, notwithstanding the difference in the degree of "aspiration" (to borrow the term preferred by Dixit 1987) between both examples, the vocal cords remain vibrating throughout. Figure 12. lc (/ghiya/) gives an example of the "voiceless" type. Here again the vocal cords are vibrating throughout the stop closure. But unlike the regular "lead" type, the burst is followed by a period of voiceless instead of voiced aspiration.1 The actual realization of the stop depends on the speaker and on articulatory facts. The "lag" type of stop is both speaker-dependent and articulatorily motivated, whereas the "voiceless" type is solely articulatorily determined. My informants differed extremely with respect to the voicing 1
The results just mentioned caused me in a previous paper (Schiefer 1984) to reject the feature "glottal stricture," as it is applicable to the regular type of the breathy voiced stop only.
300
12 Lieselotte Schiefer Table 12.2 Occurrence of the voicing lead in breathy voiced stops (in percent)
(a) P.U.N. Labial Dental Retroflex Velar X
(b) S.W.A. Labial Dental Retroflex Velar X
a
e
i
o
u
X
100 85 100 100 96
93 — 100 100 97
87 93 80 100 90
82 100 100 88 92
71 94 100 92 89
87 93 96 96
60 85 91 93 82
78 — 100 100 92
78 100 93 100 93
96 100 100 90 96
70 95 100 90 89
76 95 96 94
lead. Out of four informants only two had overwhelmingly "lead" realizations: M.A.N., who produced "lead" stops throughout, and R.P.J., in whose productions the voicing lead was absent only twice. The data for the other two informants, P.U.N. and S.W.A., are presented in table 12.2 for the different stop-vowel sequences, as well as the mean values for the places of articulation and the vowels. The values are rounded to the nearest integer. P.U.N. and especially S.W.A. show a severe influence of place of articulation and vowel on the voicing lead, which is omitted especially in the labial place of articulation by both of them and when followed by the vowel /a/ in S.W.A.'s productions. This is interesting, as it is usually believed that prevoicing is most easily sustained in these two phonetic conditions (Ohala and Riordan 1980). The "voiceless" type usually occurs in the velar place of articulation and/or before high vowels, especially before /i/ (see for detail Schiefer 1984). The average duration values for the acoustic portions for all informants are given in figure 12.2 for the different places of articulation. It is interesting to note that, notwithstanding the articulatorily conditioned differences within the single acoustic portions, the duration of all portions together (i.e. voicing lead + burst + breathy voiced vowel) differs only minimally, except the values for /bh/ in S.W.A.'s productions. This points to a tendency in all subjects to keep the overall duration of all portions the same or nearly the same in all environments (see for detail Schiefer 1989). As the words were read in citation form and not within a sentence frame, it is uncertain whether these results mirror a general articulatory behavior of the subjects or whether they are artifacts of the recording procedure. 301
Segment S.W.A.
MAN.
R.P.J.
lab dent refr vel
1
1
1
1
1
1 1
1
1
1
1
lab dent retr vel
D VLD •
lab dent retr vel
P.U.N.
burst
• BRED
lab retr vel 20
40
60
80
100
120
140
160
msec. Figure 12.2 Durations of voicing lead (VLD), burst, and breathy voiced vowel (BRED)
(a)
(b)
(c)
Figure 12.3 (a) In this figure displays the spectrogram of/b h alu/ - the beginning of the breathy voiced vowel portion is marked by a cursor; (b) gives the oscillogram for the vowel immediately following the cursor position in (a); and (c) shows the power spectrum, calculated over approximately 67.5 msec, from cursor position to the right
302
12 Lieselotte Schiefer
(a)
(b)
(c).
Figure 12.4 The beginning of the steady vowel is marked by a cursor in (a); (b): oscillogram from cursor to the right; (c): power spectrum calculated over 67.5 msec, of the clear vowel
12.3 Perception of Hindi stops
12.3.1 Experiment 1 12.3.1.1 Material, method, and subjects A naturally produced/bhalu/(voicing lead = 79.55 msec, burst = 5.2 msec, breathy voiced vowel = 94.00 msec) produced by S.W.A. was selected as point of departure for the manipulation of the test items. This item was used for several reasons: (a) the item belongs to the "lead" type, having prevoicing throughout the stop closure (see fig. 12.3a and b) and a breathy voiced vowel portion following the stop release; (b) there is only minimal degrading of the voicing lead towards the end of the closure; (c) the prevoicing is of a sufficient duration; (d) the breathy voiced portion is long enough to allow the generation of seven different test items; (e) the degree of "aspiration" is less; and (f) there is a remarkable difference between the amplitude of the fundamental and that of the second harmonic in the breathy 303
Segment Table 12.3 Scheme for the reduction of the breathy voiced portion ("PP" = pitch period; "—" = pitch periods deleted; "+" = remaining pitch periods) Stimulus 1 Stimulus 2 Stimulus 3 Stimulus 4 Stimulus 5 Stimulus 6 Stimulus 7
original; PP1-PP18 - (PP1, PP8, PP15) - (PP1, PP4, PP8, PP11, PP15, PP18) - (PP1, PP3, PP5, PP7, PP9, PP11, PP13, PP15, PP17) - (PP1, PP2, PP4, PP6, PP7, PP8, PP10, PP12, PP13, PP14, PP16, PP18) + (PP6, PP12, PP18) none
voiced vowel portion (see fig. 12.3c) compared with the steady one (see fig. 12.4c), which is one of the most efficient acoustic features in the perception of breathy voiced phonation (Bickley 1982; Ladefoged and Antonanzas-Baroso 1985; Schiefer 1988). The method used for manipulation was that of speech-editing. The first syllable was separated from the rest of the word in order to avoid uncontrollable influences from context. The syllable was delimited before the transition into /I/ was audible, and cut into single pitch periods. The first pitch period, which showed clear frication, was defined as "burst" and was not subjected to manipulation. The breathy voiced portion of the vowel was separated from the clear vowel portion by inspection of the oscillogram combined with an auditory check. The boundary between both portions is marked in figure 12.4a. Note that the difference between the amplitude of the fundamental and that of the second harmonic in the clear vowel is small and thus resembles that of modal voice (fig. 12.4c). The following portion of eighteen pitch periods was divided into three parts containing six pitch periods each. A so-called "basic-continuum," consisting of seven stimuli, was generated by reducing the duration of the breathy voiced vowel portion in steps of three pitch periods (approximately 15 msec). The pitch periods were chosen from the three subportions of the breathy voiced vowel by applying the scheme shown in table 12.3. From this basic continuum eight tests were derived, where, in addition to the manipulation of the breathy voiced portion, the duration of the voicing lead was reduced by approximately 10 msec, (two pitch periods) each. Tests 1-8 thus covered the range of 79.55 to 0 msec, voicing lead. The continua were tested in an identification task in which every stimulus was repeated five times and presented in randomized order. All stimuli were followed by a pause of 3.5 sec. while blocks of ten stimuli were separated by 10 sec. Answer sheets (in ordinary Hindi script) required listeners to choose whether each stimulus was voiceless unaspirated, voiceless aspirated, voiced, or breathy voiced (forced choice). 304
12 Lieselotte Schiefer
All tests were carried out in the Telefunken language laboratory of the Centre of German Studies, School of Languages, of the Jawaharlal Nehru University in New Delhi and were presented over headphones at a comfortable level. The twelve subjects were either students or staff members of the Centre. They were paid for their participation (see for details Schiefer 1986). 12.3.1.2 Results The results are plotted separately for the different response categories in figures 12.5-12.8. The ordinate displays the identification ratios in percent, the abscissa the duration of burst and breathy voiced vowel portion in milliseconds for the single stimuli. Figure 12.5 shows that stimuli 1—4 elicit breathy voiced responses in tests 1-5 (voicing lead (VLD) = 79.55 msec, to 32.9 msec). In test 6 (VLD = 22.5 msec.) only the first three stimuli of the continuum elicited breathy voiced answers, whereas in tests 7 (VLD = 10.4 msec.) and 8 (no VLD) none of the stimuli was identified unambiguously as breathy voiced. Thus the shortening of the voicing lead does not affect the breathy voiced responses until the duration of this portion drops below 20 msec, in duration. Stimuli 5-7 of tests 1 ^ (VLD = 79.55 msec, to 41.9 msec.) were judged as voiced (see fig. 12.6). In test 5 (VLD = 32.9 msec), on the other hand, only stimulus 6 was unambiguously perceived as voiced, the responses to stimulus 7 were at chance level. In tests 6-8 no stimulus was perceived as voiced. This means that the lower limit for the perception of a voiced stop lies at 32.8 msec voicing lead. If the duration of the voicing lead drops below that value, voiceless unaspirated responses are given, as shown in figure 12.7. In comparing the perceptual results for the voiced and breathy voiced category it is obvious that voiced responses require a longer prevoicing than breathy voiced ones. The shortening of both portions, voicing lead and breathy voiced, leads to the perception of a voiceless aspirated stop (fig. 12.8). The perception of voiceless aspirated stops is the most interesting outcome of this experiment. One may argue that the perception of stop categories (like voiced and aspirated) simply depends on the perceptibility of the voicing lead and the amount of frication noise following the release. If this were true, voiceless aspirated responses should have been given to stimuli 1-4 in tests 6-8, since in these tests the stimuli with a short breathy voiced portion, stimuli 5-7, were judged as voiceless unaspirated, which implies that the voicing lead was not perceptible. But it is obvious that at least in test 6 breathy voiced instead of voiceless aspirated responses were elicited. In all tests, the voiceless aspirated function reaches its maximum in the center of the continuum (stimulus 4), not at the beginning. On the other hand it seems that the perception of a voiceless aspirated stop 305
Segment 100 -o- VLD = 79.55 -•
VLD = 58.65
- - - VLD = 50.80 -o- VLD = 41.90 -x- VLD = 32.90 - * - VLD = 22.50 -*• VLD = 10.40 --
VLD = 0
Figure 12.5 Percent breathy responses from experiment 1
100-a- VLD = 79.55 -•
VLD = 58.65
- • - VLD = 50.80 -
VLD = 0
Figure 12.6 Percent voiced responses from experiment 1
cannot be explained by the acoustic content of the stimulus itself. In order to achieve a shortening of the breathy voiced portion the eliminated pitch periods were taken from different parts within the breathy voiced vowel portion. This means that the degree of frication, which is highest immediately after the oral release, degrades as the breathy voiced portion shortens. 306
12 Lieselotte Schiefer 100-•- VLD = 79.55
80-
-• VLD = 58.65 --- VLD = 50.80
60
-o- VLD = 41.90 -*- VLD = 32.90
40-
-*• VLD = 22.50 -^ VLD = 10.40
20-
-• VLD = 0
0* 1 Figure 12.7 Percent voiceless unaspirated responses from experiment 1
100 -a- VLD = 79.55 - • VLD = 58.65 --- VLD = 50.80 -o» VLD = 41.90 -~- VLD = 32.90 -*- VLD = 22.50 -*• VLD = 10.40 --
VLD = 0
Figure 12.8 Percent voiceless aspirated responses from experiment 1
12.3.2 Experiment 2 In a second experiment we tried to replicate the results of the first one by using a rather irregular example of a bilabial breathy voiced stop for manipulation. The original stimulus (/bhola/) consisted of voicing lead (92.75 307
Segment
msec), a burst (11.1 msec), a period of voiceless aspiration (21.9 msec), and a breathy voiced portion (119.9 msec) followed by the clear vowel. It should be mentioned that the degree of aspiration in the breathy voiced portion was less than in the first experiment. The test stimuli were derived from the original one by deleting the period of voiceless aspiration for all stimuli and otherwise following the same procedure as described in experiment 1. Thus, 8 test stimuli were obtained, forming the basic continuum. A series offivetests was generated from the basic continuum by reducing the duration of the voicing lead from 92.75 msec (test 1) to 37.4 msec (test 2), and then eliminating two pitch periods each for the remaining three tests. The same testing procedure was applied as in experiment 1. The results for tests 1-3 resemble those of experiment 1. The continuum is divided into two categories: breathy voiced responses are given to stimuli 1-5, and voiced ones to stimuli 6-8 (see fig. 12.9). Only stimulus 8 cannot be unambiguously assigned to the voiced or voiceless unaspirated category in test 3. Tests 4 and 5 produced comparable results to tests 7 and 8 of experiment 1: there is an increase in voiceless aspirated responses for stimuli 4—6 (see fig. 12.10). In comparing the results from the two experiments, it is obvious that the main difference between them concerns stimuli 1-3 in those tests in which the duration of the voicing lead drops below about 20 msec (tests 7 and 8 in experiment 1 and tests 4 and 5 in experiment 2). Whereas in experiment 1 these stimuli are ambiguous, they are clearly identified as breathy voiced in experiment 2. This result is directly connected with the "acoustic content" of the stimuli, i.e. the greater amount of "aspiration" in experiment 1 and a lesser one in experiment 2. On the other hand the experiments are comparable as to the rise of voiceless aspirated answers in the center of the continuum, which cannot be explained by the acoustic content, as the degree of aspiration degrades with the reduction of the breathy voiced portion. This result can be explained when the duration of the unshortened breathy voiced portion is taken into account: it appears that it exceeds that of the first experiment by about 40 msec 12.4 Discussion
12.4.1 Summary of findings The acoustic analysis of Hindi breathy voiced stops has revealed two main findings. First, the realization of the stop release depends to a high degree on articulatory constraints such as velar place of articulation or high vowel. Second, the realization of this stop category depends on subject-specific and on articulatory facts. Two out of four subjects produced this stop class with prevoicing (only two exceptions), whereas in the data of P.U.N. and S.W.A. 308
12 Lieselotte Schiefer
-n- VLD = 92.75 - • VLD = 37.4 -•- VLD = 22.15 -o- VLD = 12.35 -x- VLD = 0
Figure 12.9 Percent breathy responses from experiment 2
100
A
80
60
ll
-a- VLD = 92.75
U \
40
- • VLD = 37.4 --- VLD = 22.15 -o- VLD = 12.35 -x- VLD = 0
20 +
1
4 5 Stimuli
Figure 12.10 Percent voiceless aspirated responses from experiment 2 the prevoicing is missing especially in the labial place of articulation and before the vowel /a/ in up to 40 percent of the productions. 2 These acoustic 2
Comparable results are reported by Poon and Mateer (1985) for Nepali, another Indo-Aryan language; seven out often informants lacked the prevoicing. Langmeier et al. (1987) found in the data of one speaker of Gujarati (from Ahmedabad) that prevoicing was absent from about 50 percent of the productions. 309
Segment
results neither support nor refute M. Ohala's (1979) view that prevoicing is a relevant feature of Hindi breathy voiced stops. The results from the perception tests are even more difficult to interpret. Several outcomes have to be discussed in detail. There is a clear tendency to divide the continua into two stop classes, breathy voiced and voiced, if the prevoicing is of sufficient duration. If the duration drops below 32.9 msec, (experiment 1) and the breathy voiced portion is reduced to about 40 msec, a voiceless unaspirated instead of a voiced stop is perceived. On the other hand, a breathy voiced stop is heard as long as the duration of the voicing lead does not drop below 22.5 msec, and that of the breathy voiced vowel does not become shorter than about 30-40 msec. Otherwise, a voiceless aspirated stop is perceived. These results point to some main differences in the perception of voiced and breathy voiced stops: (a) Hindi stops are perceived as voiced only if they are produced with voicing lead of a sufficient duration (about 30 msec). There are no trading relations between the voicing lead and the duration of the breathy voiced vowel portion, (b) Breathy voiced stops are perceived even if the duration of the voicing lead approaches the threshold of perceptibility, as can be concluded from the perception of either voiced or voiceless aspirated stops. If the voicing lead is eliminated totally the responses to the stimuli depend on the duration of the breathy voiced portion: if it is of moderate duration only (79.55 msec, experiment 1) the first two stimuli of the continuum cannot be unambiguously identified; if it is long, as in experiment 2 (131 msec), the stimuli are judged as breathy voiced. When both the breathy voiced vowel and the voicing lead are short, we find voiceless aspirated responses. These results provide evidence that the voicing lead is less important in the perception of breathy voiced stops than of voiced ones, and that trading relations exist between the duration of the voicing lead and that of the breathy voiced vowel portion. Thus the perception of a breathy voiced stop does not depend solely either on the perceptibility of the voicing lead or on the duration of the breathy voiced vowel portion. It is rather subject to the overall duration of voicing lead + burst + breathy voiced vowel portion. If the duration of this portion drops below a given value the perceived stop category changes. From these puzzling results we can conclude that neither the perception of breathy voiced stops nor of voiceless aspirated stops can be explained solely by the acoustic content of the stimuli. Listeners seem to perceive the underlying glottal gesture of the stimuli, which they clearly separate from a devoicing gesture: they respond with breathy voiced if the gesture exceeds about 60 msec, whereas they respond with voiceless aspirated if the duration 310
12 Lieselotte Schiefer
drops below that limit. This hypothesis is directly supported by results from physiological investigations, where it was shown that the duration of the glottal gesture for a breathy voiced stop is about double that of a voiceless aspirated one (Benguerel and Bhatia 1980). Finally, let us turn to the interpretation of the feature "voice-onset time." It must be stated that our results support neither Ohala's concept nor that of Ladefoged and Schiefer, since what is important in the perception of breathy voiced and voiceless aspirated stops is not the voicing lead by itself but the trading relation with the breathy voiced portion. On the other hand, from our results it can be concluded that the Hindi stops form three natural classes, (a) The voiced and voiceless unaspirated stops form one class, as they are perceived only if the duration of burst and breathy voiced portion is shorter than 30-40 msec, i.e. if the burst is immediately followed by a regularly voiced vowel. This result is comparable with those from voice-onset-time experiments, which showed that the duration of the voicing lag in voiceless unaspirated stops rarely exceeds 30 msec, (b) Breathy voiced and voiceless aspirated stops form one class according to the release portion of the stop, which is either a breathy voiced or voiceless vowel, whose duration has to be longer than about 44 msec, in breathy voiced stops and has to be 32-65 msec, in voiceless aspirated stops. Both stops show trading relations between the voicing lead and the breathy voiced portion, (c) Voiceless unaspirated and voiceless aspirated stops can be grouped together with regard to their voicing lead duration which has to be shorter than 32.9 msec, in both stops, (d) Obviously, voiced and breathy voiced stops do not form a natural class: voiced stops need a longer voicing lead than breathy voiced ones (32.9 vs. 22.5 msec.) and voiced stops do not show any trading relations between voicing lead and the breathy voiced portion, whereas breathy voiced stops do. All the results of the experiments can be summarized in two main points: we must distinguish between two acoustic portions, the stop closure and the stop release; and the perception of stops depends on the duration of the whole glottal gesture underlying the production of the stop. 12.4.2 Feature representation
Let us now consider the representation of the Hindi stop system within the framework of distinctive features. Since we consider both stop portions, closure and release, as important for the perception of the stops we have to assign one (multivalued) feature to each phase (see table 12.4). We propose the feature "lead onset time" to account for the differences in the stop closure and "onset of regular voicing" for the differences in the stop release. According to the feature "lead onset time" we assign "0" to the voiceless unaspirated and voiceless aspirated stops, "2" to the voiced and " 1 " to the 311
Segment Table 12.4 Feature specification description of Hindi stops
Lead onset time Onset of regular voicing
proposed for
the
p
ph
b
bh
0 0
0 2
2 0
1 2
breathy voiced stop. In assigning " 1 " to the breathy voiced stop we take account of the fact that the voicing lead of breathy voiced stops is shorter than that of voiced ones (cf. Schiefer 1988) or may even be missing altogether and that the voicing lead is less important in the perception of breathy voiced stops than in voiced ones. The way we apply the feature "onset of regular voicing" to the stops differs from Ladefoged, Ohala, or Schiefer (1984), as we now group together the voiceless unaspirated and the voiced stops by assigning them the same value, namely " 0 " . In doing this we account for the similarity in the perception of both stops: they are perceived if the duration of the breathy voiced portion is shorter than about 30 msec. In assigning " 2 " of the feature "onset of regular voicing" to the voiceless aspirated and breathy voiced stops we take into consideration the similarity of these stops in the perception experiments. On the other hand, we do not specify the acoustic nature of this portion. Thus, this portion may be characterized by either a voiceless or a breathy voiced vowel portion in the case of the breathy voiced stop, i.e. may represent a regular or a "voiceless" type of the breathy voiced stops. Therefore, our feature specification is not restricted to the representation of the regular type of the stop but is applicable to all stop types mentioned above. The present feature set allows us to group the stops together in natural classes by (a) assigning the same value of the feature "lead onset time," " 0 , " to the voiceless unaspirated and voiceless aspirated tops; (b) assigning the same value, " 0 , " to the voiceless unaspirated and voiced stops; and (c) assigning " 2 " of the feature "onset of regular voicing" to the voiceless aspirated and breathy voiced stops. In summary: we have used the results of acoustical and perceptual analysis, as well as the comparison of these results, to set up a feature specification for the Hindi stops which is based on purely phonetic grounds. Thus we are able to avoid a mixture of phonetic and phonemic arguments or to rely on evidence from other languages. In particular, we have shown that the addition of perceptual results helps us to form natural stop classes which seem to be psychologically real. It is to be hoped that these results will 312
12 Comments encourage phonologists not only to rely on articulatory, acoustic, and physiological facts and experiments but to take into account perceptual results as well.
Comments on chapter 12 ELISABETH SELKIRK The Hindi stop series includes voiceless and voiced stops, e.g. /p/, /b/, and what have been typically described (see e.g. Bloomfield 1933) as the aspirated counterparts of these, /p h / and /b h /. The transcription of the fourway contrast in the Indo-Aryan stop system, as well as the descriptive terminology accompanying it, amounts to a "naive" theory of the phonological features involved in representing the distinctions, and hence of the natural class into which the sounds fall. The presence/absence of the diacritic h can be taken to stand for the presence/absence of a feature of "aspiration," and the b/p contrast taken to indicate the presence/absence of a feature of "voice." In other words, according to the naive theory there are two features which cross-classify and together represent the distinctions among the Hindi stops, as shown in (1). (1)
Naive theory of the Hindi stop features /b/ IPl /Ph/ "Aspiration" — + — "Voice" +
/bh/ + +
For clarity's sake I have used " + " to indicate a positive specification for the features and " — " to indicate the absence of a positive specification, though the naive theory presumably makes no commitment on the means of representing the binary contrast for each feature. The object of Schiefer's study is to provide acoustic and perceptual evidence bearing on the featural classification of the Hindi stops. Schiefer interprets "aspiration" as a delay in the onset of normal voicing in vowels, joining Catford (1977) in this characterization of the feature. "Voice" for Schiefer corresponds to a glottal gesture (see Browman and Goldstein 1986) producing voicing which typically precedes the release of the stop, and which may carry over into the postrelease "aspirated" portion in the case of voiced aspirated stops. The name given to the feature - "lead onset time"-is perhaps unfortunate in failing to indicate that what is at issue is the phasing of the "voice" gesture before, through, and after the release of the stop. Indeed, the most interesting result of Schiefer's study concerns the phasing of this voicing gesture. In plain unaspirated voiced stops voicing always 313
Segment
precedes the stop release. In aspirated voiced stops, by contrast, there is a considerable variability amongst speakers in whether voicing precedes the release at all, and by how much, and variability too in whether voicing continues through to the end of the postrelease aspirated portion of the sound. Schiefer observes a tendency in all subjects to keep constant the overall duration of the voicing (prevoicing, burst, and postvoicing) in the voiced aspirates, while the point of onset of the voicing before the stop burst might vary. This trading relation between prevoicing and postvoicing plays a role in perception as well: the overall length of the voiced period, independent of the precise timing of its realization with respect to the burst, appears to be the relevant factor in the perception of voiced aspirate stops. Schiefer proposes, therefore, that prevoicing and postvoicing are part of the same gesture, in the Browman and Goldstein sense. Thus, while both the voiced aspirated and plain voiced (unaspirated) stops have in common the presence of this voicing gesture, named by the feature "lead onset time," they are distinguished in the details of the timing, or phasing, of the gesture with respect to the release. Ladefoged (1971: 13) took issue with the naive theory of Hindi stops: "when one uses a term such as voiced aspirated, one is using neither the term voiced nor the term aspirated in the same way as in the descriptions of the other stops." For Ladefoged "murmured [voiced aspirated] stops represent a third possible state of the vocal cords." Schiefer's study can be taken as a refutation of Ladefoged, confirming the naive theory's assumption that just two features, perfectly cross-classifying, characterize the Hindi stops. For Schiefer the "different mode of vibration" Ladefoged claims for the voiced aspirate stops would simply be the consequence of the different possibilities of phasing of the voicing gesture in the voiced aspirates. The different phasing results in (a) shorter or nonexistent prevoicing times (thereby creating a phonetic "contrast" with plain voiced stops), and (b) the penetration of voicing into the postrelease aspirated period (thereby creating a phonetic "contrast" with the aspiration of voiceless unaspirated stops). These phonetic "contrasts" are a matter of detail in the phonetic implementation of the voicing gesture, and do not motivate postulating yet another feature to represent the distinctions among these sounds. The chart in (2) gives the Schiefer theory of the featural classification of Hindi stops: (2)
Schiefer's theory of the Hindi stops "Onset of regular voicing" "Lead onset time"
Ivl
0 0
314
/p V 2 0
/W
0 2
/I 2 1
12 Comments
The different values for "lead onset time" given by Schiefer to /b/ and /bh/ " 2 " and " 1 , " respectively - are justified by her on the basis of (a) the phasing difference in the realization of the voicing gesture in the two cases, and (b) the fact that voicing lead is less important in the perception of voiced aspirates than it is with plain voiced stops. Schiefer seems to imply that the representation in (2) is appropriate as a phonological representation of the contrasts in the Hindi stop series. This is arguably a mistake. I would like to suggest that (2) should be construed as the representation of the phonetic realization of the contrasts in the Hindi stop series, and that the chart in (3), a revision of the naive theory (1) in accordance with Schiefer's characterization of the features "aspiration" and "voice," provides the appropriate phonological representation of the contrasts: (3)
An alternative theory of the Hindi stop series /P/ /Ph/ /b/ "Onset of regular voicing" — + — "Lead onset time" — — +
/bV + +
My proposal is that the phonological representation (3) is phonetically implemented as something like (2). The representation in (2) presupposes that the features involved are n-ary valued. This n-ariness may make sense in a phonetic representation. In the phonetic implementation of "lead onset time," for example, it is indeed necessary to specify the difference in phasing of the feature in the two different types of voiced stops. This could be represented by means of an nary scale. But there are good reasons why (2) does not hold up as a representation of phonemic contrasts. Phonological considerations give no basis for making anything more than a binary distinction in the voicing dimension: either the voicing gesture is there, or it is not. Consider first the fact that in Hindi the difference in the phasing of the voicing gesture is not itself contrastive, i.e. plays no role in the phonology. Indeed, which value - 1 or 2 - a voiced sound bears for the feature lead onset time is predictable. It is entirely a function of what the specification of the sound is for the feature for aspiration. Given the assumption that a phonological representation contains only feature specifications that are contrastive, i.e. only ones which allow for phonemic contrasts in the language, the 1/2 distinction in "lead onset time" cannot be phonological. In other words, " 1 " and " 2 " are simply allophonic variants of a same underlying feature specification, which is a positive specification for the presence of the glottal voicing gesture. Banning the " 1 " vs. " 2 " specification for the voicing feature from underlying phonological representation appears to be supported by more 315
Segment
general phonological considerations as well. The two candidates for the phonological representation of Hindi stops, (2) and (3), presuppose two different general theories of the features for "voicing" and "aspiration." The two different theories make radically different predictions about what sorts of stop systems might exist among the world's languages. The naive theory (revised) makes the claim that no more than two phonemic distinctions are ever made on the basis of voice (or lead onset time), namely "voiced"/ "voiceless," and that no more than two phonemic distinctions can be made on the basis of aspiration (or delay in onset of regular voicing), namely "aspirated"/"unaspirated." Thus, according to the theory presupposed by (3), Hindi and the other Indo-Aryan languages exhaust the contrastive possibilities offered by the two features. The n-ary theory that is presupposed if (2) is understood as a phonological representation makes no such restrictive claims. As long as the specifications 0, 1, 2, etc. are taken to be phonological, it is predicted that languages could display a far greater range of phonemic distinctions based on "lead onset time" and "delay in regular voicing" than is seen in Hindi. This prediction seems not to be borne out. A system like Hindi's is rare enough, and it is unlikely that languages will be found that exploit a further array of distinctions based on these phonetic dimensions alone. It is probably not premature to rule out the n-ary theory of these two features on the grounds that it fails to make sufficiently restrictive predictions about possible sound systems in language. The ability to characterize narrowly the possible systems of contrast across the world's languages provides one criterion for evaluating alternative phonological feature theories. Another criterion for a feature theory is that it be able to capture generalizations about the phonological processes found in languages. If (3) is the correct representation of the voiced aspirates in Hindi and other languages, then there are two sorts of predictions made about how these stops should "behave" in the sound patterning of the language. First, it is predicted that voiced aspirates and plain voiced sounds should behave as a natural class with respect to any rule manipulating simply the feature voice, and it is predicted that voiced and voiceless aspirates should behave as a natural class with respect to any rule manipulating aspiration. Hindi appears not to exhibit any such voice- or aspiration-manipulating processes, and so fails to provide relevant evidence with respect to this prediction (see M. Ohala 1983). But the closely related Marathi has a word-final deaspiration rule, and it affects both the voiceless and voiced aspirates, leaving them as plain voiceless and voiced stops, respectively (Houlihan and Iverson 1979; Alan Prince, p.c. - both reports are based on fieldwork, not on instrumental studies). According to the revised naive theory, the fourway phonemic contrast in Marathi would have the phonological representation (3). The two 316
12 Comments
types of sounds deaspirated are identified by their common feature specification [+ (delayed) onset of regular voicing], and that specification is eliminated (or changed to " —") in the operation of the deaspiration rule. The claim made by the theory in (3) is that such a deaspiration rule will necessarily treat voiced and voiceless aspirates as a natural class regardless of the details of the phonetic realization of the aspiration gesture in the language. There is a second difference in the predictions about phonological processes made by the n-ary and binary theories of voicing in Hindi and other Indo-Aryan languages. The theory in (3) predicts that if voiced aspirates are deaspirated in some language, they will always be realized as phonetically identical to plain voiced sounds, regardless of whether or not in that language the voicing feature has a different phonetic realization in the two types of voiced sounds. No such prediction is made if a representation like (2) is taken as the phonological representation. Indeed, according to this latter theory, if such a deaspiration were to take place in a language with phonetic properties similar to Hindi's, it would be predicted the deaspirated /b h/ would be realized by a [b1], distinct from the [b2] realizing the plain voiceless one /b/. (The superscripts 1 and 2 correspond to the putative phonological feature values for "lead onset time" that would be predicted to remain unmodified by the deaspiration.) Sanskrit has a rule which deaspirates aspirated segments in reduplicated prefixes. The deaspirated voiced sounds behave just like underlying plain voiced sounds with respect to further rules of the phonology, in particular the interword rules of external sandhi. This is predicted by the naive theory. Indeed, the naive theory predicts there should always be such an absolute neutralization under deaspiration. The n-ary theory makes no such prediction. Rather, it predicts that the deaspirated voiced aspirated could display phonological and phonetic behavior distinct from that of the plain voiced stop. Cross-linguistic phonological evidence of the sort outlined above has yet to be systematically accumulated, and so it is not at this point possible to say for sure whether it is the n-ary or the binary theory of the features for "voice" and "aspiration" which makes the right predictions. What is important is that it is clear what sorts of evidence would be relevant to deciding the case. And the evidence comes from the phonological treatment of voiced aspirates in the various languages, not from the phonetics laboratory. Phonological feature theory has always embraced the notion that individual phonetic features are grounded in phonetic reality, and has looked to phonetics for confirmation of hypotheses about the nature of particular phonological features. Schiefer's study forms part of this phonetics-phonology give and take. Her results on the nature of voiced aspirates provide the phonetic basis for assuming, as the naive phonological theory in (1) has done, 317
Segment
that there are just two features involved in characterizing the fourway contrast /p, ph, b, bh/ in Hindi and the other Indo-Aryan languages. But phonological feature theory has not looked to phonetics for an answer to the question of whether features are monovalent, binary, or n-ary. This is a question that cannot be answered without examining the workings of phonological systems, along the lines that have been suggested above. The phonetic dimensions corresponding to phonological features are typically gradient, not quantal. It is in phonology that one finds evidence of the manner in which these dimensions are quantized, and so it is phonology that must tell us how to assign values to phonological features.
318
Section C Prosody
319
13
An introduction to intonational phonology D. ROBERT LADD
13.1 Introduction The assumption of phonological structure is so deeply embedded in instrumental phonetics that it is easy to overlook the ways in which it directs our investigations. Imagine a study of "acoustic cues to negation" in which it is concluded, by a comparison of spectrographic analyses of negative and corresponding affirmative utterances, that the occurrence of nasalized formants shows a statistically significant association with the expression of negation in many European languages. It is quite conceivable that such data could be extracted from an instrumental study, but it is most unlikely that anyone's interpretation of such a study would resemble the summary statement just given. Nasalized formants are acoustic cues to nasal segments-such as those that happen to occur in negative words like not, nothing, never (or non, niente, mai or n'e, n'ikto, n'ikogda, etc.)-rather than direct signals of meanings like "negation." The relevance of a phonological level of description - an abstraction that mediates between meaningful units and acoustic/articulatory parameters - is taken for granted in any useful interpretation of instrumental findings. Until recently, the same sort of abstraction has been all but absent from instrumental work on intonation. Studies directly analogous to the hypothetical example have dominated the experimental literature, and the expression "intonational phonology" is likely to strike many readers as perverse or contradictory. In the last fifteen years or so, however, a body of theoretical work has developed, empirically grounded (in however preliminary a fashion) on instrumental data, that gives a clear meaning to this term. That is the work reviewed here.1 1
The seminal work in the tradition reviewed here is Pierrehumbert (1980), though it has important antecedents in Liberman (1975) and especially Bruce (1977). Relevant work since 1980 includes Ladd (1983a), Liberman and Pierrehumbert (1984), Gussenhoven (1984), Pierrehumbert and Beckman (1988), and Ladd (1990).
321
Prosody
13.2 Linear structure
The most important tenet of the new phonological descriptions is that fundamental frequency (Fo) is best understood as a sequence of discrete phonological events, rather than as a continuously varying contour characterizable by overall shape and direction. Obviously, in some very direct phonetic sense Fo "is" a continuously varying contour (although even that statement abstracts away from gaps associated with voicelessness) - but at that level of abstraction the same is true of, say, the second formant. What is required for both Fo and F2 - is a further abstraction, in which a phonological string (of "segments," "tones," etc.) can be mapped onto a sequence of phonetic targets. The continuously varying contour emerges when the phonetic targets are joined up by "low-level" transitions.2 13.2.1 Basic aspects of intonational structure 13.2.1.1 Pitch accents and boundary tones In most European languages (and many others as well) the most important of the discrete events that make up the pitch contour are pitch accents. These are characteristic Fo features that accompany prominent syllables - peaks, valleys, falls, rises, etc. This use of the term "pitch accent" is due to Bolinger (1958), though in comparison to Bolinger's original usage the current sense shows certain differences of emphasis to which I will return in section 13.4.1. (Bolinger's concept - though not his term - corresponds more closely to the "prominence lending pitch movements" in the system of 't Hart and his colleagues, e.g. Cohen and 't Hart [1967], 't Hart and Collier [1975], Collier and 't Hart [1981].) Figure 13.1 shows an ordinary declarative utterance of English with two pitch accents. Besides pitch accents, the other main phonological elements that make up Fo contours are boundary phenomena of various sorts, at least some of which are generally known as boundary tones (see 13.2.2.2 below). The clearest cases are abrupt final rises taking place within the last 300-500 msec, of a phrase or utterance, generally analyzed as "high boundary tone." There are also sometimes distinctive effects at initial boundaries, as in the RP pronunciation of The bathroom?! shown in figure 13.2. This has a high initial boundary tone (i.e. a distinctively high starting pitch not associated with an accented syllable), followed by a low or low-rising pitch accent, followed by a final boundary rise. This is not to suggest that a target-and-transition model is necessarily the ideal phonetic model of either Fo or spectral properties, but only that, as a first approximation, it is equally well suited to both. 322
13 D. Robert Ladd
Figure 13.1 Speech wave and F o contour for the utterance Her mother's a lawyer, spoken with an unemphatic declarative intonation. The peaks of the two accents (arrows) are aligned near the end of the stressed syllables mo- and law-
Figure 13.2 Speech wave and F o contour for the sentence The bathroom?! (see text for detail). The valley of the low(-rising) pitch accent is aligned near the end of the syllable bath323
Prosody 13.2.1.2 Association of tunes to texts The basic phonological analysis of a pitch contour is thus a string of one or more pitch accents together with relevant boundary tones. Treating this description as an abstract formula, we can then speak of contour types or tunes, and understand how the same contour type is applied to utterances with different numbers of syllables. For example, consider the tune that can be used in English for a mildly challenging or contradicting echo question, as in the following exchange.3 (1) A: I hear Sue got a fellowship to study physics. B: Sue? On the monosyllabic utterance Sue this contour rises and falls and rises again. However, we are not dealing with a global rise-fall-rise shape that applies to whole utterances or to individual syllables, as can be seen when we apply the same contour to a longer utterance: (2) A: I hear Sue's taking a course to become a computer programmer. B: A computer programmer? The rise-fall-rise shape that spanned the entire (one-syllable) utterance in Sue? is not simply stretched out over the seven-syllable utterance here; nor is it applied to the accented syllable -pu- alone. Instead, the contour is seen to consist of at least two discrete elements, a pitch accent that rises through the accented syllable and then falls, and a boundary rise that is confined to the last few hundred msec, of the utterance. The F o on the syllables -ter program- is simply a transitional stretch between the low level reached at the end of the pitch accent and the beginning of the final rise. Given an appropriate utterance, such a transitional stretch could be extended even further. As the foregoing example makes clear, one of the key assumptions of current intonational phonology is that it is possible for syllables - sometimes several consecutive syllables-to be phonologically unspecified for Fo. The validity of this assumption is perhaps most clearly demonstrated by Pierrehumbert and Beckman in their work on Japanese (1988: esp. ch. 4). They show that the traditional analysis-in which every mora is distinctively associated with either a H or L tone - makes phonetic predictions that are falsified by their data. Their empirical findings are modeled much more successfully if we assume that tones are associated only with certain points in the segmental string. 3
In order to appreciate the force of examples (1) and (2) it is important to get the intonation right on B's reply. In particular, one contour that is not intended here is a more or less steadily rising one, which conveys surprise or merely a request for confirmation. 324
13 D. Robert Ladd 13.2.2 Pitch accents as sequences of tones 13.2.2.1 The two-level theory
The outline of intonational structure just sketched has many aspects that go back to earlier work, such as the British nuclear tone descriptions (well summarized in Crystal 1969), the American levels analyses of Pike (1945) and Trager and Smith (1951), and especially Bolinger's notion of pitch accent (Bolinger 1958, 1986). The most important innovation of the work under review here is that pitch accents, and in some cases perhaps boundary tones, are further analyzed as sequences or combinations of high and low tones. This approach is based in part on the tonal phonology of many African languages, in which it is well established that two "level" lexical tones (such as H and L) can occur on the same syllable and yield a phonetically falling or rising contour. As applied to languages like English, the decomposition of pitch accents into tones has been a point of considerable contention. The dichotomy drawn by Bolinger (1951) between intonational analyses based on "levels" and those based on "configurations" has sometimes been invoked (e.g. 't Hart 1981) as a basis for not analyzing pitch accents in this way. However, as I have argued elsewhere (Ladd 1983b), Bolinger's theoretical objections really apply only to the American "levels" analyses just mentioned. The approach under consideration here effectively solves Bolinger's levels-vs.configurations issue by reducing the number of distinct levels in languages like English or Dutch to two, and by defining them in such a way that their phonetic realization can vary quite considerably from one pitch accent to another. That is, there is no presumption, as there was in the traditional levels analyses, that a given phonological abstraction like H will necessarily correspond to a certain Fo level; the mapping from phonological tones to Fo targets is taken to be rather complex. The two-level theory, first formulated explicitly by Pierrehumbert (1980), constitutes the central theoretical innovation on which the current approaches to intonational phonology are based. Theoretical issues aside, the laboratory evidence for target levels in Fo is strong. In perhaps the clearest result of this sort, Bruce (1977) found that the most reliable acoustic correlate of word accent in Swedish is a peak in the Fo contour aligned very precisely in time with respect to the accented syllable. The rise preceding the peak, and/or the fall that follows it, can be suppressed or reduced under certain circumstances; the important thing for signaling the accent is, in Bruce's words, "reaching a certain pitch level at a particular point in time ..., not the movement (rise or fall) itself (1977: 132). In another experiment, Liberman and Pierrehumbert (1984) had speakers utter specific contour types with wide variations of overall speaking range. They 325
Prosody
found that the phonetic property of the contour that remained most invariant under range variation-i.e. the thing that most reliably characterized the contour types-was the relationship in Fo level between the two accent peaks of the contours; other measures (e.g. size of pitch excursion) were substantially more variable. Finally, it has been shown repeatedly that the endpoints of utterance-final Fo falls are quite constant for a given speaker in a given situation (e.g. Maeda 1976; Menn and Boyce 1982; Liberman and Pierrehumbert 1984; Ladd 1988 for English; Ladd et al 1985 for German; van den Berg, Gussenhoven, and Rietveld [this volume] for Dutch; Connell and Ladd 1990 for Yoruba). It has been suggested that this constant endpoint is (or at least, reflects) some sort of "baseline" or reference value for the speaker's Fo range.4 13.2.2.2 Some remarks on notation Pierrehumbert (1980) proposed a notational system for intonation that expresses the theoretical ideas outlined in the foregoing sections, and her system has been adopted, with a variety of modifications, by many investigators. The basic points of this system, with a few of the modifications that have been suggested, are outlined in this section. Discussion of the theoretical issues underlying the differing versions of the notation is necessarily extremely condensed in what follows. Pitch accents contain at least one tone, either H or L, which is associated with the accented syllable. In addition, they may contain a preceding or following tone, for example in cases where the pitch accent is characterized by rapid Fo movement rather than just a peak or a valley. In Pierrehumbert's original system, the tone associated with the accented syllable is written with an asterisk (H* or L*), and if there is a preceding or following tone in the pitch accent it is written with a following raised hyphen (H~ or L~); in a bitonal pitch accent the two tones are joined with a + (e.g. L* + H). In some systems based on Pierrehumbert (e.g. the one used by van den Berg, Gussenhoven, and Rietveld, in this volume), both the plus and the raised hyphen are dispensed with, and one would write simply L*H. In any case, it is convenient to distinguish "starred tones" (T*) from "unstarred tones" (T~ or just T), as the two types may exhibit certain differences in their phonological and phonetic behavior. Boundary tones, associated with the edges of intonational phrases, are 4
It has also been suggested (e.g. Pierrehumbert and Beckman, this volume) that the invariance of contour endpoints has been exaggerated and/or is an artifact of Fo extraction methods, and that the scaling of contour endpoints can be manipulated to signal discourse organization. Whether or not this is the case, it does not affect the claim that target levels are linguistically significant - in fact, if Pierrehumbert and Beckman are right it would in some sense strengthen the argument. 326
13 D. Robert Ladd
written with the diacritic % (H% or L%). Between the last pitch accent and the boundary tone of any phrase, in Pierrehumbert's original analysis, there is another tone she called the "phrase accent." Since this tone does not seem to be associated with a specific syllable but rather trails the pitch accent by a certain interval of time, Pierrehumbert considered this to be an unstarred tone like those that can be part of pitch accents, and therefore wrote it T . For example, the rising-falling-rising tune illustrated earlier would be written L* + H~ L~ H%, with a low-rising pitch accent, a low phrase accent, and a high final boundary tone, as in (3): (3)
L* + H - L
H%
a computer programmer However, the status of phrase accents has remained a matter of some controversy (for discussion see Ladd 1983a). In what seems to be a promising approach to resolving these difficulties, Beckman and Pierrehumbert (1986) have proposed that the "phrase accent" is actually the boundary tone for an intonational domain smaller than the intonational phrase, a domain they call the "intermediate phrase." That is, the end of an intermediate phrase is marked only by what Pierrehumbert called a phrase accent, whereas the end of an intonational phrase is marked by both a "phrase accent" and a fullfledged "boundary tone." Hayes and Lahiri (1991) have adopted this analysis in their description of Bengali, and use it to motivate a useful notational innovation. For the boundary tone of an intonational phrase (Pierrehumbert's T%) they write T; for the boundary tone of an intermediate phrase-which they call "phonological phrase" in line with other work on prosodic structure - they write T p . Accent tones continue to be written as starred (or not as the case may be), so that Hayes and Lahiri's notation clearly distinguishes accent tones from boundary tones. The rising-falling-rising tune just illustrated would thus, in Hayes and Lahiri's notation, be written something like L*H Lp H r 13.2.3 Intonation and lexical tone One important consequence of the point of view just outlined is that it puts the relationship between "tone" and "intonation" in a different light. A fairly traditional view is that all languages have "intonation" - global F o shapes and trends - and in addition some languages have local F o perturbations for "word accent" or "tone" overlaid on the global intonation (see e.g. Lieberman 1967: 101f.). More generally, the traditional view is that F o features have 327
Prosody
extent over some domain (syllable, word, phrase, utterance), and that Fo contours are built up by superposing smaller-domain features on largerdomain ones, in a manner reminiscent of Fourier description of complex waveforms. The intonational models of e.g. O'Shaughnessy and Allen (1983) or Gronnum (this volume; see also Thorsen 1980a, 1985) are based on this approach. This view was convincingly challenged by Bruce (1977). Bruce showed that in Swedish, the phonetic manifestations of at least certain phrase-level intonational features are discrete events that can be localized in the Fo contour (Bruce's sentence accents). That is, the relationship between lexically specified Fo features (the Swedish word accents) and intonationally specified ones is not necessarily a matter of superposing local shapes on global ones, but involves a simple succession of Fo events in time. In the most restrictive versions of current intonational phonology, it is explicitly assumed that independently chosen global shapes-e.g. a "declination component" - are not needed anywhere in the phonological description. (This is discussed further in 13.3.2 below.) In effect, the restrictive linear view says that all languages have tonal strings; the main difference between languages with and without lexical tone is simply a matter of where the tonal specifications come from. In some languages ("intonation languages") the elements of the tonal string are chosen, as it were in their own right, to convey pragmatic meanings, while in others ("tone languages") the phonological form of morphemes often or always includes some tonal element, so that the tonal string in any given utterance is largely a consequence of the choice of lexical items. In this view, the only tonal elements that are free to serve pragmatic functions in a tone language are boundary tones, i.e. additional tonal specifications added on to the lexically determined tonal string at the edge of a phrase or utterance. This is how the theory outlined here would account for the common observation that the final syllable of a phrase or utterance can have its lexical tone "modified" for intonational reasons (see e.g. Chang 1958). Additionally, some tone languages also seem to modify pitch range, either globally or locally, to express pragmatic meanings like "interrogation." The possibilities open to lexical tone languages for the tonal expression of pragmatic meanings are extensively discussed by Lindsey (1985). The principal phonetic difference between tone languages and intonation languages is simply a further consequence of the functional difference. More "happens" in Fo contours in a tone language, because the tonal specifications occur nearly every syllable and the transitions span only milliseconds, whereas in a language like English most of the tonal specifications occur only on prominent words and the transitions may span several syllables. But 328
13 D. Robert Ladd
the specifications are the same kind of phonological entity regardless of their function, and transitions are the same kind of phonetic phenomenon irrespective of their length. There is no basis for assuming that lexical tone involves a fundamentally different layer in the analysis of Fo contours. 13.3 Phonetic models of F o
In order to generate Fo values from an abstract string of tonal events, some sort of phonetic model is required; in order to formulate a useful phonetic model, some account must be taken of the central fact that there are conspicuous individual differences of Fo level and range. This is obviously a
problem for a phonological description based on tones that are mapped on to Fo targets. Indeed, the ability to abstract away from individual differences of Fo level and range is one of the principal attractions of any intonational description based on configurations or contour shapes. A rise is a rise, whether it moves from 80 Hz to 120 Hz or from 150 Hz to 300 Hz. A "rate" of declination, say 1.2 semitones/sec, is a quantitative abstraction that could in principle be applied to any speaker. Nevertheless, we know that languages exist in which tonal phonology is based on pitch level, not contour shape. Moreover, as noted earlier, there is growing evidence of regularities statable in terms of pitch level even in languages without lexical tone. It seems appropriate, therefore, to devise a phonetic model of Fo that will allow us to express such regularities and at the same time to abstract away from individual differences of level and range. 13.3.1 Baseline and tonal space
Intonational phonologies like those outlined in the previous section have by and large built their phonetic models around the notion of a speaker-specific tonal space, a band of Fo values relative to which the tonal events are scaled.5 This tonal space is somewhat above the speaker-specific baseline, a theoretical bottom of the range which in speaking is normally reached - if at all-only at the end of utterance-final falls. For example, a pitch accent analyzed phonologically as a sequence of H tone and L tone might be "Tonal space" is an ad hoc term. In various contexts the same general abstraction has been referred to as tone-level frame (Clements 1979), grid (Garding and her co-workers, e.g. Garding 1983), transform space (Pierrehumbert and Beckman [1988], but note that this is not really intended as a technical term), and register (e.g. Connell and Ladd 1990). The lack of any accepted term for this concept is indicative of uncertainty about whether it is really a construct in its own right or simply a consequence of the interaction of various model parameters; only for Garding does the "grid" clearly have a life of its own. See further section 13.3.2.
329
Prosody \
\
i ^ ^ * register shift
\
|
i
\
/ - ^ / '
\ \
\
/
\
\y
tonal space
\ |
\
'
\ \
baseline
Figure 13.3 Idealization of the two-accent Fo contour from figure 13.1, illustrating one way in which the accents might be modeled using the concept of "tonal space"
modeled phonetically as a fall from the top of the tonal space to the bottom, ending some distance above the baseline. This is shown in Figure 13.3. The mathematical details vary rather considerably from one model to another, but this basic approach can be seen in the models of e.g. Bruce (1977), Pierrehumbert (1980; Liberman and Pierrehumbert 1984; Pierrehumbert and Beckman 1988), Clements (1979, 1990), Ladd (1990; Ladd et al. 1985; Connell and Ladd 1990), and van den Berg, Gussenhoven, and Rietveld (this volume); to a considerable extent it is also part of the models used by 't Hart and his colleagues and by Garding and her colleagues (see note 5). Mathematical details aside, the biggest point of difference among these various models lies in the way they deal with differences of what is loosely called "pitch range." Pretheoretically, it can readily be observed that some speakers have wide ranges and some have narrow ones; that pitch range at the beginning of paragraphs is generally wide and then gets narrower; that pitch range is widened for emphasis or interest and narrowed when the topic is familiar or the speaker is bored or depressed. But we lack the data to decide how these phenomena should be expressed in terms of the parameters of phonetic models like those described here. For example, many instrumental studies (e.g. Williams and Stevens 1972) have demonstrated that emotional arousal - anger, surprise, etc.-is often accompanied by higher Fo level and wider Fo range. But level is generally defined in these studies in terms of mean Fo (sampling Fo every 10-30 msec, and thus giving equal weight to targets and transitions); range is usually viewed statistically in terms of the variance around the Fo mean. We do not know, in terms of a phonetic model like the ones under consideration here, whether these crude data reductions reflect a raising of the tonal space relative to the baseline, a widening of the tonal space, a raising of everything including the baseline, or any of a number of other logical possibilities. A great deal of empirical work is needed to settle 330
13 D. Robert Ladd
questions like these. In the meantime, different models have taken rather different approaches to these questions. 133.2 Downtrends Perhaps the best illustration of such differences is the treatment of overall Fo downtrends ("declination" and the like). Since the work of Pierrehumbert (1980), it is widely accepted that at least some of what has earlier been treated as global declination is in fact downstep - a stepwise lowering of high Fo targets at well defined points in the utterance. However, this leaves open a large number of fairly specific questions. Does downstep lower high targets by narrowing the tonal space or by lowering it? Is it related to the "resetting of pitch range" that often follows prosodic boundaries (see the papers by Kubozono, and van den Berg, Gussenhoven, and Rietveld, this volume)? Are there residual downtrends - true declination - that downstep cannot explain? If there are, do we model such declination as a gradual lowering of the baseline, as a gradual lowering of the tonal space relative to the baseline, or in some other way? Appropriately designed studies have made some progress towards answering these questions; for example, the coexistence of downstep and declination is rather nicely demonstrated in Pierrehumbert and Beckman's work on Japanese (1988: ch. 3). But we are a long way from understanding all the interactions involved here. A further theoretical issue regarding downtrends - and in some sense a more basic one - is whether the shape and direction of the tonal space can be chosen as an independent variable, or whether it is conditioned by other features of phonological structure and the tonal string. In the models proposed by Garding and her co-workers (e.g. Garding 1983; Touati 1987) and by 't Hart and his colleagues (e.g. Cohen and 't Hart 1967; 't Hart and Collier 1975; Cohen, Collier, and 't Hart 1982), although there is a clear notion of describing the contour as a linear string of elements, the tonal space can also be modified globally in a way that affects the scaling of the elements in the string. This recalls the global intonation component in traditional models of tone and intonation (see 13.2.3 above). In models more directly inspired by Pierrehumbert's work, on the other hand, the tonal space has no real life of its own; any changes to the tonal space are either due to paralinguistic modification of range, or are - like downstep - triggered by phonological choices in the tonal string and/or in prosodic structure. 13.4 Prosodic structure
No discussion of current work on intonational phonology would be complete without some consideration of the relevance of metrical phonology. The 331
Prosody
cornerstone of this theoretical development was Liberman's work on English stress (Liberman 1975; Liberman and Prince 1977), which argued that stress or prominence crucially involves a relation between two nodes in a binary tree structure, e.g.: (4)
(w = weak, s = strong) A great deal of theoretical work on relational and hierarchical structures in phonology has followed Liberman's lead, and it is well beyond the scope of this review to trace all these developments. (The introduction to autosegmental and metrical phonology in van der Hulst and Smith [1982] remains an excellent introduction to the rapid early developments of these theories.) Here I wish to concentrate on the relevance of metrical phonology to laboratory work on two fairly specific phenomena: pitch accents and higher level phrasing. 13.4.1 The role of pitch accents in prominence
Early experimental work on the acoustic cues to perceived stress in isolated words and short utterances (notably the classic experiments by Fry 1955, 1958) pointed to an important role for pitch movement or pitch change on the affected syllable. This in turn cast doubt on the traditional distinction between stress accent (or "dynamic accent") and pitch accent (or "melodic accent"), and led Bolinger to redefine the term pitch accent to mean a local pitch configuration that is simultaneously a cue to prominence and a building block of the intonation contour. Bolinger's concept survives more or less intact in the model of intonation put forth by 't Hart and his colleagues, whose term "prominence-lending" pitch movement clearly implies that the pitch movement is the sine qua non of prominence in utterances. Bolinger's terminology, on the other hand, has been taken over for a modified concept in the work of Pierrehumbert and many of the other authors under review here, in the sense of a building block of the intonation contour that is merely anchored to or associated with a prominent syllable. This association, and indeed the prominence itself, are frequently described in terms of metrical structure. That is, prominence is more abstract than pitch movement, and can be cued in a variety of other ways. A good deal of experimental evidence points to this conclusion. First, it 332
13 D. Robert Ladd
has been shown by both Huss (1978) and Nakatani and Schaffer (1978) that certain putative differences of stress can be directly reflected in syllable duration without any difference of pitch contour. Second, Beckman (1986) has demonstrated clear perceptual differences between Japanese and English with respect to accent cues: in Japanese accent really does seem to be signaled only by pitch change, whereas in English there are effects of duration, intensity, and vowel quality that play an important perceptual role even when pitch change is present.6 Third, there are clear cases where pitch cues are dependent on or orchestrated by a prominent syllable without being part of the syllable itself. In Welsh declarative intonation, for example, a marked pitch excursion up and down occurs on the unstressed syllable following the major stressed syllable; the stressed syllable may be (but need not be) distinguished durationally by the length of the consonant that closes it (Williams 1985). This makes it clear that the pitch movement is an intonational element whose distribution is governed by the (independent) occurrence of prominence, and suggests that pitch movement cues prominence in a rather less direct way than that imagined by Fry and other earlier investigators. 13.4.2 Higher-level phrasing It is well established that differences of hierarchical structure or boundary strength (Cooper and Paccia-Cooper 1980) can be reflected intonationally in the height of F o targets in the vicinity of boundaries. For example, Ladd (1988) showed that, in structures of the sort A and B but C and A but B and C (where A, B, and C are clauses with three accented syllables each), the accents at the beginning of the B and C clauses are slightly higher when they follow a Z?w/-boundary than when they follow an #m/-boundary. Ladd interpreted this as reflecting a difference of hierarchical organization, as in (5): (5) A and B but C
A but B and C
Data on the phonetics of prominence in Tamil, reported by Soundararaj (1986), suggest that Tamil is like Japanese in using only pitch to cue the location of the accented syllable; Soundararaj found no evidence of differences of duration, intensity, or vowel quality (see also the statements about Bengali in Hayes and Lahiri 1991). Interestingly, however, the function of pitch in Tamil (or Bengali) is rather more like the function of pitch in English or Dutch, i.e. purely intonational, not lexically specified as in Japanese. That is, phonetically speaking, Tamil is (in Beckman's sense) a "pitch-accent" language like Japanese, but functionally speaking it uses pitch like the "stress-accent" languages of Europe. If this is true, it provides further evidence for excluding function from consideration in modeling F o in different languages (see 13.2.3 above). 333
Prosody
Related findings are reported by e.g. Cooper and Paccia-Cooper (1980) and by Thorsen (1985, 1986). The principal issue that these findings raise for intonational phonology involves the nature of the hierarchical structures and their relationship to phonetic models of Fo. Are we dealing directly with syntactic constituent structure, as assumed for example by Cooper and Paccia-Cooper? Are we dealing with some sort of discourse organization, in which boundary strength is a measure of the "newness" of a discourse topic? (This has been suggested by Beckman and Pierrehumbert [1986], following Hirschberg and Pierrehumbert [1986].) Or are we dealing with explicitly phonological or prosodic constituent structure, of the sort that has been discussed within metrical phonology by, e.g., Nespor and Vogel (1986)? This latter possibility has been suggested by Ladd, who sees "relative height" relations like
(6) /
\ (h = high; 1 = low)
as entirely analogous to the relative prominence relations that are basic to metrical phonology. For Ladd, in other words (as for van den Berg, Gussenhoven, and Rietveld, this volume), boundary-strength phenomena are thus intimately related to downstep, which can be seen as the reflection of the same kind of relational structure. For Beckman and Pierrehumbert, on the other hand, boundary-strength phenomena are similar to paralinguistic phenomena in which the overall range is raised or lowered to signal interest, arousal, etc. These theoretical issues cannot yet be adequately resolved empirically because they are intertwined with issues of phonetic modeling. That is, in terms of the phonetic-realization models discussed in section 13.3 above, boundary-related differences of Fo scaling are all pretheoretically a matter of "pitch range" and all could be expressed in a number of ways: as local expansion of the tonal space, as local raising of the tonal space, as local raising of the baseline, as greater prominence of individual accents, etc. One's understanding of which phenomena are related to which others cannot at present be separated from one's choice of how to model the phonetic detail.
334
14 Downstep in Dutch: implications for a model ROB VAN DEN BERG, CARLOS GUSSENHOVEN, and TONI RIETVELD
14.0 Introduction
In this paper we attempt to identify the main parameters which must be included in a realistic implementation model for the intonation of Dutch.* The inspiration for our research was provided by Liberman and Pierrehumbert (1984) and Ladd (1987a). A central concern in those publications is the phonological representation and the phonetic implementation of descending intonation contours. The emphasis in our research so far has likewise been on these issues. More specifically, we addressed the issue of how the interruption of downstep, henceforth referred to as reset (Maeda 1974; Cooper and Sorensen 1981: 101) should be represented. Reset has been viewed as (a) an upward register shift relative to the register of the preceding accent, and (b) as a local boost which interrupts an otherwise regular downward trend. We will argue that in Dutch, reset should be modeled as a register shift, but not as an upward shift relative to the preceding accent, but as a downward one relative to a preceding phrase. Accordingly, we propose that a distinction should be made between accentual downstep (which applies to H* relative to a preceding H* inside a phrase), and phrasal downstep, which reduces the range of a phrase relative to a preceding phrase, and creates the effect of reset, because the first accent of the downstepped phrase will be higher than the last downstepped accent of the preceding phrase. We will present the results of two fitting experiments, one dealing with accentual downstep, the other with phrasal downstep. Both address the question whether the two downstep factors (accentual and phrasal) are independent of speaker (cf. men vs. women) and prominence (excursion size of the contour). In addition, the first experiment addressed the issue whether This paper has benefited from the comments made by an anonymous reviewer. 335
Prosody
the accentual downstep factor depends on the number of downstepped accents in the phrase. Before reporting on these data, we give a partial characterization of the intonation of Dutch in section 14.1. This will enable us to place the phenomenon of (accentual and phrasal) downstep in a larger phonological context, and will also provide us with a more complete picture of the parameters to be included in an implementation model. This model, given in section 14.2, will be seen to amount to a reorganization of a model proposed in Ladd (1987b). Section 14.3.1 reports on the experiment on accentual downstep and discusses the data fitting procedure in detail. The experiment on phrasal downstep is presented in section 14.3.2. 14.1 Intonation in Dutch
14.1.1 Tonal structure Like English and German, Dutch is an intonation language without lexical tone. Focus distribution determines the locations of accents in the utterance, each accent being an insertion slot for one of a number of possible pitch accents. Two common contours are given in figure 14.1, on the utterance Leeuwarden wil meer mannen ("Leeuwarden needs more men"). We assume that the tone segments which these contours consist of are H*L L% in (14.1a) and L*H H% in (14.1b). As shown in the figures, the first (starred) tone segment associates with the accented first syllable, the second tone segment spreads, while the last goes to the end of the utterance. The timing of H% can be observed in (14.1b) on the final, stressless syllable of mannen, while the preceding plateau evinces the spreading of the preceding H.1 In nonfinal position, these contours appear as illustrated in figure 14.2, where they occur before a H*L L% contour on mannen. (The sentence is otherwise identical to that given in figure 14.1.) In this position, the contours can be analyzed as H*L (14.2a) and L*H (14.2b), both of which are separated from the following pitch accent by an appreciable prosodic boundary, between Leeuwarden and wil. Observe that after this boundary the pitch begins low, which leads to a pitch change at the boundary in figure 14.2b. In figure 14.2a, the pitch is low on either side of the boundary, because of the L to its left. In figure 14.3, the two pitch accents appear in a different guise again. In these realizations, there is no discontinuity after Leeuwarden of the kind we find in figure 14.2. Looking at the (a) examples across figures 14.1-14.3, we would appear to be dealing with what at some level of analysis must be 1
We assume that the presence of L% can be motivated as a trigger for "final lowering" (cf. Poser 1984; Liberman and Pierrehumbert 1984). 336
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
0.6 Time (sec.)
(b)
500 400
0.0
0.6 Time (sec.)
Figure 14.1 Contours (H*L L%) AD and (L*H H%) AD on the sentence Lee*uwarden wil meer mannen
considered the same unit, while the same goes for the (b) examples. Our claim is that the explanation for the difference between the contours in figure 14.2 and those in figure 14.3 is to be found in a difference in phrasing, rather than in a different choice of pitch accent. In a slow, explicit type of pronunciation, as illustrated by the contours in figure 14.2, the tone segments of a pitch accent are confined to the highest constituent that dominates it, without dominating another pitch accent. This constituent, which we refer to as the association domain (AD), is Leeuwarden in the case of the first pitch accent in the contours in figure 14.2, because the next higher node also dominates the following H*L. The AD for this H*L, obviously, is wil meer mannen. Unless the first syllable of the AD is the accented syllable, there will be an unaccented stretch in the AD (none in Leeuwarden, and wil meer in wil meer mannen), which we refer to as the onset of the AD. By default, onsets are lowpitched in Dutch (but they may also be high-pitched, to give greater expressiveness to the contour). Turning now to the contours in figure 14.3, we observe that the AD-boundary after Leeuwarden is lacking, which we 337
Prosody (a)
500 400
l e -
50
w
a
r
w a r
d
a
w
l
l
d a w
m
I I
e - r m
m
e* r
a
m
n
an
0.6 Time (sec.)
Figure 14.2 Countours (H*L)AD, (H*L L%) AD and (L*H)AD (H*L L%) AD on the sentence Lee*uwarden wil meer ma*nnen
suggest is the result of the restructuring of the two ADs to a single AD'. One obvious consequence is that wil meer is no longer an onset. More specifically, the consequence for the first pitch accent is that the spreading rule applying to the second tone segment no longer has a right-hand boundary it can refer to. What happens in such cases is that this segment associates with a point in time just before the following accented syllable. The pitch of the interaccentual stretch -warden wil meer is an interpolation between the shifted tone segment and the tone segment to its left.2 The association domain is thus an intonational domain, constructed on the basis of the independently existing (prosodic) structure of the utterance, not as a constituent of the prosodic tree itself. Restructuring of ADs is more likely as the prosodic boundary that separates them is lower in rank (see Gussenhoven, forthcoming). 2
The analysis follows that given for English in Gussenhoven (1983). Note that the introduction of a pitch accent L + H* for the second accent in the contour in figure 14.3a, as in Pierrehumbert (1980), would unnecessarily add a term to the inventory of pitch accents. Moreover, it would force one to give up the generalization that contours like those in figure 14.2 are more carefully pronounced versions of those in figure 14.3. 338
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
a r d a
(b)
w I
I
m
e - r
m
H
* H
an
500 400: 300 200 ^
100
\ ^ _
^
^
\ \
-~v/
\
L
L%
n
a
— ^
50 0.0
I
e•
w a r
d a w
I
I
0.3
m
e• r
m
a
0.9
0.6
1.2
Time (sec.) Figure 14.3 Contours (H*L H*L L%) AD , and (L*H H*L L%) AD , on the sentence Lee*uwarden wil meer ma*nnen
To relate the contours in figure 14.2 to those in figure 14.1, we assume that the lexical representations of the two pitch accents are H*L and L*H. Boundary Tone Assignment (see (1) below) provides the bitonal H*L and L*H with a boundary tone which is a copy of their last tone segment. We have no account for when this boundary tone segment appears, and tentatively assume it is inserted when the AD ends at an intonational phrase boundary.3 (1)
Boundary Tone Assignment'. 0 —•
aH / aH
•)„
Rule (1) should be seen as a default rule that can be preempted by other tonal processes of Dutch. These include the modifications which H*L and L*H can undergo, and the stylistic rule NARRATION, by which the starred tone in any L*H and H*L can spread, causing the second tone segment to be a boundary segment. In addition, there is a third contour H*L H%. These have been described in Gussenhoven (1988), which is a critique and reanalysis of the well-known description of the intonation of Dutch by Cohen and 't Hart (1967), 't Hart and Collier (1975), Collier and 't Hart (1981).
339
Prosody
A
Leewarden wil meer mannen
A
A
j Leewarden wil meer mannen
H LL%
J ^ \
A
Leewarden wil meer mannen \
H
L H LL%
Figure 14.4 Tonal structure of the contours in figures 14.1a, 14.2a, and 14.3a
In summary, we find: End of IP End of AD, but inside IP Inside AD'
H*L L% L*H H% Second tone segment spreads H*L L*H Second tone segment spreads H*L L*H Interpolation between T* and following tone segment spans stretch between accents
Figure 14.4 gives schematic representations of the tonal structures of the contours in figures 14.1a, 14.2a, and 14.3a. 14.1.2 Downstepping patterns
Dutch has a number of related contours, which display the downward stepping of consecutive H*s called "downstep" in Pierrehumbert (1980). 340
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
Two of the forms that downstepped contours may take are illustrated in figure 14.5. Both are instances of the sequence of placenames Haa*rlem, Rommelda*m, Harderwij*k en Den He*lder. In figure 14.5a, the H*s after a H*L have been lowered (a lowered H* is symbolized !H*), and the H*s before a H*L have spread. In the contour in figure 14.5b, the spreading of the first and second H*s stopped short at the syllable before the accent. In fact, the spreading of these H*s may be further restrained. Maximal spreading of H* will cause the L to be squeezed between it and the next !H*, such that it is no longer clear whether it has a phonetic realization (i.e. whether it forms part of the slope down to !H*) or is in fact deleted. In our phonological representations, we will include this L. It is to be noted that the inclusion of dips before nonfinal !H* in artificially produced downstepped contours markedly improves their naturalness, which would seem to suggest that the segment is not in fact deleted. Lastly, a final !H* merges with the preceding L to produce what is in effect a L* target for the final accented syllable. We will not include this detail in (2) below. The rule for downstep, then, contains two parts: the obligatory downstepping of noninitial H*s, and the variable spreading of nonfinal H*s. The domain for downstep is the AD': observe that the L does not respect the boundary between Haarlem and Rommeldam, Rommeldam and Harderwijk, etc., as shown in the contours in figure 14.5. (2)
Downstep a. H* -* !H* /T*T (obligatory) b. Spread H* / LH*L (variable) Domain: AD'
Following widely attested tonally triggered downstep phenomena in languages with lexical tone, Pierrehumbert (1980) and Beckman and Pierrehumbert (1986) propose that in English, too, downstep is tonally triggered. The reason why we cannot make the same assumption for Dutch is that contours like those in figure 14.3a exist. Here, no downstep has applied; yet we seem to have a sequence of H*L pitch accents, between which, moreover, no AD boundary occurs. Since the tonal specification as well as the phrasing corresponds with that of the contours in figure 14.5, we must assume that instead of being tonally triggered, downstep is morphologically triggered. That is, the rule in (2) is the phonological instantiation of the morpheme [downstep], which can be affixed to AD'-domains, as an annotation on the node of the constituent within which downstep takes place. The characterization of downstep as a morpheme has the advantage that all downstepped contours can be characterized as a natural class. That is, regardless of the extent to which H* is allowed to spread, and regardless of whether a modification like DELAY is applied to H*Ls (see Gussenhoven 1988), downstepped contours are all characterized as having undergone rule (2). This advantage over an analysis in which downstep is viewed as a 341
Prosody
h a - r l r m r
o
m
a
l
d
a
0.5
h a * r | r m r o m a l d a m
50
mh
a r d a
w f
I
k
1.0 Time (sec.)
h a r
d
a
1.0 Time (sec.)
f
d
1.5
w r l
k
<~n d e n h e
2.0
I
d a
r
2.0
Figure 14.5 Downstepped contours H*L !H*L !H*L !H*L on the sentence Haa*rlem, Rommelda*m, Harderwij*k en Den He*lder
phonetic implementation rule, triggered by particular configurations of tone segments, was claimed by Ladd (1983a) for his proposal to describe downstep with the help of a phonological feature [±DOWNSTEP]. The explanatory power of a feature analysis appears to be low, as recognized in Ladd (1990). Beckman and Pierrehumbert (1986) point out that the feature analysis leaves unexplained why the distribution of [ 4- downstep] is restricted to noninitial H (i.e. why not !L, or why not !H initially?). Secondly, the nonlocal effect of downstep on the scaling of all following tone segments inside the AD' would need a separate explanation in a feature analysis. The postulation of a domain-based rule like (2) accounts for these properties in the same way as does a rule of downstep which is conceived of as a phonetic implementation rule. Moreover, it avoids what we consider to be two drawbacks of the approach taken by Beckman and Pierrehumbert (1986). First, the domain-dependency of downstep, the fact that it "chains" within the AD, needs a separate explanation. Second, the phonetic-implementation analysis requires that all mid tones are grouped as a natural class. The second 342
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
mid pitch of the vocative chant, for example, a contour with a very different meaning from the downstepped contours illustrated in figure 14.5, is obtained by the same implementation rule that produces downstepped accents in Pierrehumbert (1980), Pierrehumbert and Beckman (1986). A disadvantage of our solution is that it is not immediately clear what a nondiacritic representation of our morpheme would look like. We will not pursue this question here. 14.1.3 Reset As in other languages, non-final AD's in which downstepping occurs, may be followed by a new AD' which begins with an accent peak which is higher than the last downstepped H* of the preceding AD'. In theory, there are a number of ways in which this phenomenon, henceforth referred to as reset, could be effected. First, reset could be local or global: that is, it could consist of a raising of the pitch of the accent after the boundary, or it could be a raising of the register for the entire following phrase, such that the pitch of all following tone segments would be affected. Reset as local boost. The idea of a local boost is tentatively entertained by Clements (1990). In this option, the downward count goes on from left to right disregarding boundaries, while an Fo boost raises the first part or the first H* in the new phrase, without affecting following H*s. Although local boosting may well be possible in certain intonational contexts, the contour in figure 14.6 shows that reset involves a register shift. It shows a sequence of two AD's, the first with four proper names and the second with five, all of them having H*L. The fifth proper name, Nelie, which starts the new AD', is higher than the fourth, Remy. The second and third accent peaks of the second AD', however, are also higher in pitch than the final accent of the preceding AD'. Clearly, the scaling of more than just the first H* of the new AD' has been affected by the Reset. Reset as register shift using accentual downstep factor. If reset does indeed involve a register shift, what is the mechanism that effects it? This issue is closely related to that of the size of the shift. Adapting a proposal for the description of tonally triggered downstep in African languages made by Clements (1981), Ladd (1990) proposes that pitch accents are the terminal nodes of a binary tree, whose branches are labeled [h-1] or [1-h], and whose constituency is derived from syntax in a way parallel to the syntaxphonology mapping in prosodic/metrical phonology. Every terminal node is a potential location for a register shift. Whether the register is shifted or not, and if so, whether it is shifted upward or downward, is determined by the Relative Height Projection Rule, reproduced in (3). 343
Prosody
Figure 14.6 Contour (H*L !H*L !H*L !H*L) !(H*L !H*L !H*L !H*L !H*L) on the sentence (Merel, Nora, Leo, Remy), en (Nelie, Mary, Leendert, Mona en Lorna)
(3)
Relative Height Projection Rule: In any metrical tree or constituent, the highest terminal element (HTE) of the subconstituent dominated by / is one register step lower than the HTE of the subconstituent dominated by h, iff the / is on a right branch.
(4)
(a)
h 0
1 1
h 1
1 2
h 0
h 1
1 2
h 1
1 2
h 0
h 1
h 2
1 3
h 1
1 2
If we take (3) as an exhaustive description, the convention predicts that in fill] labeled structures there is neither downstep nor reset. In consistently leftbranching [h-1] labeled trees, there is downstep for the second accent only, and no reset. In consistently right-branching [h-1] labeled trees, there is a chain of downstepped accents, which can be followed by a reset for a righthand sister terminal node. As shown in (4), where the numbers below the terminal nodes indicate the number of register steps required by (3), there is no reset after a two-accent structure (see (4a)), but after three or more accents 344
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
there is reset, whose size depends on the number of accents that precede in the left-hand constituent (see (4b, c)). We doubt whether [h-1] labeled trees provide the appropriate representation for handling downstep. For one thing, within the AD', H*Ls are downstepped regardless of the configuration of the constituents in which they appear (see also Pierrehumbert and Beckman 1988: 168). What we are concerned with at this point, however, is the claim implicit in Ladd's proposal that the size of the reset is the same as, or a multiple of, the size of downstep inside an AD'. The contour in figure 14.7a suggests they are not equal in size. In this contour, the third H*, the one beginning the second AD, does not go up from the second H* by the same distance that the second H* was lowered from the first. Neither is the third H* scaled at the same pitch as the second. It in fact falls somewhere between the first and the second, a situation which does not obviously follow from the algorithm in (3). Reset as register shift using factor independent of accentual downstep. If we
assume that reset involves a register shift whose step size is not the same as the downstep factor, the question arises whether reset involves a register shift upwards with reference to the preceding accent, or a register shift downward with reference to the preceding phrase. In the first option, a peak scaling algorithm would apply a raising factor to the Fo of the last H* of a phrase so as to scale the first H* of a following phrase. This new value would then be used for the calculation of the value for the second, downstepped, H*, and so on. This option resembles the boost option discussed above, except that the effect of the boost will make itself felt on the scaling of all subsequent tone segments. Kubozono (1988c) describes the interruption of downstep in Japanese in this way. In the second option, a register reset is obtained by means of a lowering with reference to the preceding phrase. Notice that the computation of the reset could still be argued to take place on the basis of the pitch value of an immediately preceding target, in the sense that the scaling of the new phrase takes place with reference to the preceding phrase. These two theories of downstep-independent reset make different predictions about the size of the second and following resets in an utterance. If we assume that the reset factor is independent of the number of times it has been applied in an utterance, then, in an accent-relative model, we should expect multiple resets in an utterance to be a constant fraction of the preceding H* (the last accent in the preceding phrase). In view of the fact that AD'-final !H*s gravitate towards the same low target, the prediction is that chains of resets should not very much decrease in size. By contrast, a phrase-relative model would lead one to expect an unambiguously downstepping series of resets, since every new reset is calculated on the basis of the lowered value of the preceding one, just as AD'-internal downstepped accents form a descend345
Prosody
1.2
1.8
2.4
3.0
Time (sec.) Figure 14.7 Scaling domain of three ADs: (a) with phrasal downstep on the sentence (Merel, Nora, Leo), (Remy, Nelie, Mary), (en Mona en Lorna); (b) without phrasal downstep on the sentence de mooiste kleren, de duurste schoenen, de beste school... ("the finest clothes, the fanciest shoes, the best school . . . " )
ing series. Thorsen's (1984a, b) discussion of "textual downdrift" in Danish suggests that a phrase-relative model would be the better choice for those data, and we hypothesize that the same is true for Dutch. Figure 14.7a illustrates a contour containing three ADs, of which the first two have H*L !H*L L*H (all accented items are proper names). Phrasal downstep has occurred twice, once on the second AD' and once on the third: observe the first H*s of the three AD's form a downstepping pattern. This view of reset as a wheels-within-wheels model of downstep would allow the sizes of the two downstep factors to be different. Conceivably, it could also account for Thorsen's (1980b) finding that the size of accentual downstep (or equivalently in her description, the slope of the declination; see also Ladd [1984]) is smaller for medial phrases than for initial or final ones. Medial phrases have a reduced register relative to preceding phrases, and the absolute size of the accentual downstep will therefore be smaller. And the utterance-final phrase will have a relatively steeper declination slope because offinallowering at the level of the utterance. Just as in the case of accentual downstep, we assume that phrasal 346
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
downstep is morphologically triggered. That is, sequences of ADs do not have to be downstepped. Figure 14.7b is an example of a series of nondownstepped ADs. The phonological coherence of the three ADs in contour 14.7b derives from the fact that all three AD's begin at the same pitch level, even though within each AD' accentual downstep has taken place. (Phonetically, because there are only two accents in the AD, this manifests itself as a timing of the final fall before the peak of the accented syllable.) Crucially, the scaling of the first H* determines that of the first H* of each following AD' in contour 14.7b just as much as it does in contour 14.7a. That is, we are not dealing with a sequence of "unconnected" phrases in the latter, as opposed to a "connected" series in the former. We will call the constituent over which the speaker's pitch range choice remains in force the scaling domain. Phrasal downstep, then, is a morpheme that optionally attaches to the scaling domain: if it is there, a contour like 14.7a results, if it is not, one like 14.7b. 14.1.4 Some issues concerning L* scaling
There are two senses in which Downstep could be a register shift. In one interpretation, Downstep affects the H* and the L of a series of H*L's inside an AD', but would leave the scaling of any following L*H pitch accent unaffected. In the other interpretation Downstep causes both H* and L* targets to be scaled down (Ladd 1987, 1990). In our model, accentual downstep and phrasal downstep are mathematically independent, and it could therefore incorporate the effect of either or both types of downstep on L* scaling. So far, we have not addressed the issue of L* scaling, but for the time being we take the position that the scaling of L* is only affected by phrasal downstep. In other words, the L* target does not depend on the number of (accentually) downstepped H*s preceding it within the same AD7, but L*s will be lower as more downstepped AD's precede it. A further issue bearing on the scaling of L* targets is the effect of prominence. A contour can retain its identity while being realized with different degrees of prominence. More prominent accent peaks have higher peak values, ceteris paribus, than less prominent ones (Rietveld and Gussenhoven 1985). While the effect of increased prominence on H* would thus appear to be uncontroversial, the effect of increased prominence on L* is less clear. Is L* lowered (see Liberman and Pierrehumbert 1984), does it remain fixed (see Gussenhoven 1984), or is it raised (Steele 1986)? Since prominence effects on L* scaling in Dutch await investigation, we will keep this question open.
347
Prosody 14.2 An implementation model for Dutch intonation
On the basis of the above discussion, we conclude that, minimally, a model for implementing Dutch intonation contours will need to include the following five parameters: 1 2 3 4 5
one parameter for speaker-dependent differences in general pitch range; one parameter to model the effect of overall (contour-wide) prominence; one to control the distance between the targets for H* and L*; one to model the effect of accentual downstep within the phrase; one to model the effect of phrasal downstep.
Our work on implementation was informed by Ladd (1987a), which sets out the issues in F0-implementation and proposes a sophisticated model of target scaling. His model, inspired by three sources (Pierrehumbert 1980; Garding 1983; Fujisaki and Hirose 1984) is given in (5). (5)
F0(n)
= Fr * NWKA)
F0(n) is the pitch value of the nth accent in Hz; Fr is the reference line at the bottom of the speaker's range (lowest pitch reached); N defines the current range (N > 1.0); f(Pn) is the phrase function with f(Pn) = f(Pn_,) * ds and f(P,) = 1; this phrase function scales the register down or up; d is the downstep factor (0 < d < 1); s is + 1 for downstep or — 1 for upstep; f(A) is the accent function of the form WE*T, which scales H* and L* targets; W dictates register width, i.e. the distance between H* and L* targets; T represents the linguistic tone (T = + 1 for H*, T = — 1 for L*); E is an emphasis factor, its normal value is 1.0; values > 1.0 result in more emphasis, i.e. the higher scaling of H*'s and lower scaling of L*.
We distinguish the actual scaling parameters Fr, N, IV, d, and E from the binary choice variables T and s, which latter two are specified as + 1 or — 1 by the phonological representation. (In fact, s may be —2, —3, to effect double, treble, etc. upsteps: cf. (4)). Of the scaling parameters, Fr is assumed to be a speaker-constant, while N, W, and E are situation-dependent. Notice that increasing Whas the same effect as increasing E.4 Figure 14.8 gives the model in pictorial form. It should be emphasized that in the representation given here N, W, and d do not represent absolute differences in F0-height. When comparing this model with our list offiveparameters, we see that Fr, N, and Wcorrespond to parameters 1, 2, and 3, respectively. Ladd's dis used 4
Parameter E is no longer included in the version of the model presented in Ladd (1990). The difference was apparently intended to be global vs. local: is can be used to scale locally defined prominence, as it is freely specifiable for each target. 348
14 R. van den Berg, C. Gussenhoven, and T. Rietveld H r W
|Jd !H i i
L
H
1
*
_|
i
1
1
J
Fr
Figure 14.8 Ladd's model for scaling accent targets
for both downward and upward scaling, which, in our model, is taken care of by the two downstep parameters 4 and 5. However, an important aspect of Ladd's formula is that it allows for the independent modeling of intraphrasal and interphrasal effects. Although our conception of downstep and reset differs from that of Ladd, we can retain his formula in a modified form. We include two downstep factors, which makes s superfluous. We also exclude E (see note 4). The model we propose scales all H* and L* targets within a scaling domain, i.e. the domain over which TV and W remain constant. It uses the same mathematical formula to scale targets in scaling domains with and without phrasal downstep, as well as targets in AD's with and without accentual downstep. (6)
F0(m,n) = Fr * Nf(pm) * «An) F0(m,n) is the pitch value for the wth accent in the rath phrase; Fr is the reference frequency determining general pitch range; N defines the current range (N > 1.0); f(Pm) = dpsp*(ml), the phrase function, scales the phrase reference lines for the mth phrase; dp is the phrasal downstep-factor (0 < dp < 1); SP indicates whether phrasal downstep takes place or not; Sp = + 1 if it does and Sp = 0 if it does not. f(An) = WT * da i/2*saM+T)*(n-i)9 t h e accent function, scales H* and L* targets; W determines register width, the distance between H* and L* targets; da is the downstep factor (0 < da < 1) for downstepping H* targets within the phrase; T represents the linguistic tone (T = + 1 for H*, T = — 1 for L*); Sa indicates whether accentual downstep takes place in that AD'; Sa = + 1 if it does and Sa = 0 if it does not; the inclusion of the weighting factor \ in the accent function ensures that the exponentiation for da is 1 when n = 2, 2 when n = 3, and so on. 349
Prosody H !H da r
W
!H !H
Fr
Figure 14.9 Our own model for scaling accent targets
Again we distinguish the actual scaling parameters, Fr, N, W, da, and dp, corresponding to the parameters mentioned under 1 to 5 above, respectively, from the binary choice variables, T, Sa, and Sp. The latter serve to distinguish between H* and L* targets, between AD's with and without accentual downstep, and between scaling domains with and without phrasal downstep, respectively. A pictorial representation of this model is given in figure 14.9. Again, we emphasize that N, W, da, and dp should not be taken to represent absolute Fo differences. In figure 14.9, the high reference line represents the Fo for the H* targets not subject to accentual downstep, while the low reference line represents the Fo for all L* targets. If phrasal downstep is in force (Sp= 1), we assume, at least for the time being, that its effect amounts to a downward scaling of both reference lines. If it is not (Sp = 0), both reference lines are scaled equally high for all AD's within the scaling domain. The phrase function we propose accomplishes both: f(Pm) = dp^-1) if phrasal downstep takes place, and f(Pm) = 1 if it does not. The scaling of the targets within the AD' is taken care of by the accent function given above. For an AD' without accentual downstep (Sa = 0), the accent function reduces to f(An) = WT. Consequently, all H* targets (T= + 1) are scaled on the phrase high-line (7) and all L* targets (T= — 1) on the phrase low-line (8). (7)
Fr*N f ( p m)*w
(8)
F r * Nf(pm) * i/w
If accentual downstep takes place (Sa=l), the accent function becomes 350
14 R. van den Berg, C. Gussenhoven, and T. Rietveld T
f(An) = W * da1/2*<1+'Wn-1>. The pitch value for the nth H* target (T= + 1) is given by (9), which scales the first H* target (n= 1) on the high-line. L* targets (T= — 1) are scaled on the low-line (10). Note that the scaling of L* is not affected by accentual downstep, but that it is by phrasal downstep (see 14.1.4). (9)
F r * Nf(pm) * W * dan - 1
(10)
F r * Nf(pm) * i/w
The model assumes that the parameters Fr, da, and dp are constants, with (speaker-specific) values which are constant across utterances. In fact, the downstep factors dp and da may also be constant across utterances and speakers. (This is one of the questions to be dealt with below.) If these assumptions hold, all target values in a scaling domain are fully determined with the speaker's choice of N and W for that particular scaling domain. 14.3 Fitting experiments
The aim of thefirstexperiment was to assess whether the accentual downstep factor da is independent of speaker, prominence, and/or the number of accentually downstepped accents within the AD', while the second experiment addressed the issue whether the phrasal downstep factor dp is independent of speaker and/or prominence. To this effect, we collected data from two male and two female speakers,5 all staff members of the University of Nijmegen, who produced the required contours with varying degrees of prominence. The contours used in the second experiment also allowed us to assess the relative values of the two downstep factors, da and dp. 14.3.1 The accentual downstep factor da 143.1.1 Method To preclude consonantal perturbations and effects of intrinsic pitch, the downstepping contours (of the type H*L !H*L !H*L etc.) were produced on sentences consisting of three or more instances of the nonsense word maaMAAmaa, the longest sentence consisting of six of them. The speakers were instructed to produce the sentences with four different degrees of prominence. For each speaker, a total of ten sets of 16 sentences (4 lengths 5
Originally, we had more speakers in the first experiment. However, some speakers were not able to produce downstepping contours on reiterant speech while still retaining the impression that they were producing normal sentences. We therefore discarded their utterances. 351
Prosody
times 4 prominence levels) were recorded in two sessions, which were one week apart. Subsequently, some utterances were judged to be inadequate, and discarded. In order to obtain an equal number of utterances for each of the four speakers, we randomly discarded as many utterances as necessary to arrive at a total of thirty-two sentences per length for each subject. The speech material was digitized with a sampling rate of 10 kHz. The resulting speech files were analyzed with the LPC procedures supplied by the LVS package (Vogten 1985) (analysis window 25 msec, prediction order 10 and pre-emphasis 0.95). For all nonfinal H* targets we measured the highest pitch reached in the accented vowel. This procedure resulted in Fo values which in our opinion form a fair representation of the height of the plateau that could visually be distinguished in the Fo contour. It is somewhat more problematic to establish an intuitively acceptable norm for measuring the final H* target. Visual inspection showed that the Fo contour sloped down from the plateau for the prefinal target to a lower plateau. The slope between them was generally situated in the first half of the final accented syllable. Since the contour did not offer clear cues as to where the last target should be measured, we arbitrarily decided to take a measurement at the vowel midpoint. Where the pitch-extraction program failed because vocal fry had set in, we measured the pitch period at the vowel midpoint and derived a F o value from it. Because our model does not (yet) include a final-lowering factor, these final values will be excluded from the fitting experiment. However, we will use them to obtain an estimate for the speaker's general pitch range. 143.1.2 Data fitting For each speaker, the data to be fitted are thirty-two series (differing in prominence) of two, three, four, and five values (measured in utterances of three, four,five,and six accents). The 448 data points for each speaker will be referred to as Mksn, where M stands for "measured," the supscript k gives the number of accent values in the series (2 ^ k ^ 5), s indicates the series number within the group of a particular length (1 <s^32), and n gives the accent position within the series (1 ^ « ^ k ) . The F o targets predicted by the model will be referred to as Pksn, where P stands for "predicted." The prediction equation for Pksn is derived as follows. With only one phrase the phrase function f(P) equals 1. Accentual downstep takes place, so Sa= 1. Because we are dealing with H* targets only (T= + 1), the distinction between N and W need not be retained: the two can be considered a single range parameter (R) with R = N™, which varies from utterance to utterance as indicated in the subscripts, Rks. To test the hypothesis that the value of da might depend on the number of accents, we fitted the data with two different models, for each of the four speakers separately. The first model incorporates one Fr value for 352
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
all the speaker's accent series and four da values, one for each length. This is indicated in the subscript, dak. The second model incorporates one Fr value and one da value, irrespective of series length, forcing the four daks to have the same value. With this provision, the same mathematical formula can be used for both models. Applying the above to our model's general formula, the predicted value for the nth accent in a series of k accent values thus becomes (11). (11)
Pksn = Fr * R k s ^ - i
In an optimizing procedure, the parameters to be optimized (here the daks) are assigned initial settings, while for the other parameters in the model (here Fr and the Rks) fixed values are chosen. The model's predictions are then calculated (here the Pksn), as well as a distance measure between observed and predicted data (here the Mksn and Pkm) (see below for the measure used here). Subsequently, this distance measure is minimized by an optimizing procedure. (We used iterative hill-climbing techniques [Whittle 1982]). The dak values that go with this minimum are taken as the optimized values. The (fixed) value of Fr was chosen on the assumption that the range is set by the height of the first accent in a series. We equated Mksl with PksP and because Pksl = Fr * Rks, Rks is calculated as MkJFr. Subsequently, we set the daks at the values found in the optimizing procedure and, in a second procedure, optimized the Rks estimates. Our definition of the initial Rks estimates entails a perfect fit for the first accent in each series. We therefore expected this secondary optimizing to result in a (somewhat) closer fit, at least for those series where the first accent is "out of line." All values for parameters and the goodness-of-fit index given in this paper are as calculated after this secondary optimizing. In our distance measure we weighted the P minus M distances with the absolute M values, which gives us a percentual measure of the distance. We further wanted the larger (and thus more serious) percentual deviations to carry more weight, and therefore squared the percentual distances. Thus, the measure (D) used in the optimizing procedure is the sum (over all data points, i.e. with summation over k, s, and n) of the squared percentual distances (12). (12)
D = X I 2 (Pksn - Mksny / (Mksn¥
The same distance measure was used in optimizing the Rks, but applied here to all data points within a series, i.e. with summation over n only.
143.13 Results To obtain an idea of the best possible fit the model allows, we first ran, for each speaker separately, optimizing analyses for a wide range of Fr values. (Of course, different Fr values lead to different da values because of the 353
Prosody
Table 14.1 Optimal combination of FT and da values for a length-dependent-da model and for a length-independent-da, model and goodness-of-fit indexes for the four speakers separately (for further details, see text) Length-dependent
Length-independent
Speaker
Fr
da2
da3
da4
da5
index
Fr
da
index
Fl F2 Ml M2
152 146 36 43
0.62 0.60 0.78 0.77
0.64 0.60 0.81 0.80
0.65 0.63 0.83 0.83
0.68 0.63 0.84 0.84
6.03 5.66 9.37 7.07
156 151 48 56
0.64 0.60 0.79 0.79
6.57 5.94 10.80 9.27
mathematical interdependence of the two.) We thus established the optimal combination of values for Fr and the four daks for what will be termed the length-dependent-^ model, and for Fr and da for the length-independent-^ model. These values are listed in table 14.1 for the four speakers separately (Fl and F2 are the female, Ml and M2 the male speakers). Although the distance measure is quite effective in the optimization of da k, it is somewhat opaque if one wants to assess how well a model predicts the measured data. We defined our goodness-of-fit index as the greatest difference between the measured and observed values to be found in any set of 95 percent of these 448 differences. We express this difference as an absolute percentage of the measured data, i.e. as |(P-M)/M|. For instance, a goodnessof-fit index of 8.2 percent indicates that the absolute difference between P and M is larger than 8.2 percent of M for only 22 (or 5 percent) of the 448 data points. These indexes are listed in table 14.1. Although the indexes give a general idea of the model's adequacy, they do not reveal any consistent overshooting or undershooting for particular accent positions. We therefore calculated the mean of the 32 percentual residuals, for each accent position for each length. For both models we observed a slight tendency in the residuals for the last accent in the series with four and five accent values to be negative (meaning the model undershoots these accent positions). This tendency may result from an attenuation of the downstep rate towards the end of the longer series, possibly to ensure that enough space is left for final lowering. However, the effect was very small indeed, and not consistent across speakers. In addition to a gradual attenuation of da, it is conceivable that for larger series of accents speakers use a smaller step down (i.e. a larger da) in order to have enough space for all the H* targets in the downstepping contour. Alternatively, the speakers could increase their range, i.e. start at a higher F o value. Both alternatives would be instances of what Liberman and Pierre354
14 R. van den Berg, C. Gussenhoven, and T. Rietveld Table 14.2 Fr estimates as "endpoint average" and optimal da values for a length-dependent-Adi model and for a length-independent-da, model and the indexes, for the four speakers separately Length-independent
Length-dependent Speaker
Fr
da2
da3
da4
da5
Index
Fr
da
Index
Fl F2 Ml M2
148 154 77 77
0.63 0.57 0.61 0.64
0.65 0.57 0.65 0.69
0.67 0.59 0.66 0.72
0.70 0.59 0.66 0.73
5.96 6.01 14.22 9.60
148 154 77 77
0.68 0.58 0.65 0.71
6.55 6.10 14.28 10.52
humbert (1984: 220) call "soft" preplanning, i.e. behavioral common sense, as opposed to "hard" preplanning, which would involve right-to-left computation of the contour. From the dak values for the length-dependent-da model it is clear that all speakers do to some extent have higher values for da with more targets. However, as was also the case in the data collected by Liberman and Pierrehumbert (1984), this trend is small. Indeed, a comparison of the indexes in table 14.1 shows that the inclusion of a lengthdependent da in the model results in only a limited gain. To test whether some speakers adjusted the pitch height of the first accent to the number of accents, we subjected the Fo values for the initial accents to an analysis of variance. The effect of the factor series length was not significant, F(3,496) = 1.36(p>0.10). While the values for Fr in table 14.1 are optimal, they are in no way theorybased. In order to assess how realistic it would be to assume a languagespecific da, we ran optimizing procedures for both models with an externally motivated estimate of the general pitch-range parameter. To this end, we adopted Ladd's operational definition of Fr as the average of endpoint values of utterance-final falling contours. This corresponds with the mean F o of the utterance-final plateaus. Table 14.2 gives the results. The indexes appear to be higher than the optimal values. (The one exception must probably be attributed to the procedurally inevitable two-step optimizing.) They are nearly optimal for Fl and F2, somewhat higher for M2, and quite a bit higher for Ml. The increase in da values with increasing number of targets appears to be independent of Fr and, again, only a small improvement is obtained with a length-dependent da. Finally, observe that this particular Fr estimate does not result in speaker-independent da values. Since the mathematical interdependence of Fr and da is of a nonlinear nature, a different Fr might give speaker-independent da values. If we take a value of approximately two-thirds of the endpoint average, the da obtained 355
Prosody Table 14.3 Fr estimates as "international locus" and optimal Adi values for a length-dependent-Adi model and for a length-independent-Adi model and the indexes for the four speakers separately Length-dependent
Length-independent
Speaker
Fr
da2
da3
da4
da5
Index
Fr
da
Index
Fl F2 Ml M2
100 100 50 50
0.76 0.72 0.73 0.74
0.79 0.74 0.77 0.78
0.81 0.77 0.78 0.81
0.83 0.79 0.79 0.82
7.42 7.55 10.21 7.26
100 100 50 50
0.81 0.77 0.78 0.81
9.09 9.03 10.99 9.21
appears to be speaker-independent. This particular Fr estimate may be viewed as the "intonational locus," i.e. a theoretical target which is never actually reached. The results are given in table 14.3. For Ml and M2 the index values do not differ greatly from the maximally attainable, for Fl and F2 they are somewhat higher. Again we observe a slight length dependence for da. We therefore conclude that (1) accentual downstep can be modeled with a single downstep factor, which is independent of prominence and the number of downstepped accents, but that (2) the answer to the question whether da is a speaker-independent parameter is determined by the way in which the general pitch-range parameter Fr is defined. 14.3.2 The phrasal downstep factor dp 14.3.2.1 Method
Because it was felt that it might be too strenuous for our four speakers to produce the required contours on reiterant speech, the speech material this time consisted of a meaningful sentence. The accented syllables (all part of proper names) had either /e/ or /o/ in the peak to preclude effects of intrinsic vowel pitch, and a sonorant consonant in the onset to preclude pitch perturbations. The sentence was MEREL zag LENIE en REMY, LEO zag NORA en NEELTJE, MARY zag MONA en LEENDERT, en KAREL zag ANNE, and the contour was (H*L !H*L L*H) !(H*L !H*L L*H) !(H*L !H*L L*H) !(H*L !H*L). This contour is a single scaling domain, annotated with the morpheme [DOWNSTEP], SO that the four AD's it dominates form a downstepping series. The last AD' was included to finish the contour naturally and to provide measurements of the final plateau to determine the general pitch-range parameter (see 14.3.1.1). The accentually downstepped second H* in the first three AD's allows us to also assess an optimal value for da and thus to 356
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
Table 14.4 Values for Fr, dp, da, and the goodness-of-fit index for (a) the optimal parameter combination, (b) the "endpoint average" Fr estimate, and (c) the "intonational locus" Fr estimate, for four speakers separately (b) "Endpoint average"
(a) OptimurrI Speaker
Fr
dp
da
Index
Fr
dp
Fl F2 Ml M2
156 160 70 90
0.73 0.78 0.84 0.83
0.35 0.29 0.61 0.09
6.64 6.07 7.85 10.13
135 150 74 65
0.82 0.81 0.83 0.88
(c) "Intonational locus"
Index
Fr
dp
da
Index
0.54 6.91 0.37 6.16 0.59 7.79 0.34 11.13
100 100 50 50
0.89 0.90 0.89 0.90
0.72 0.63 0.71 0.46
7.61 7.82 8.09 12.72
da
compare the two downstep factors, dp and da. The speakers produced the utterance at four prominence levels with five replications for each level, reading them from a stock of twenty cards with a different random order for each speaker. The same procedure for digitizing, speech-file analysis, and F o measurement was followed. Since we needed to fit the H* targets in the first three AD's, we collected 120 data points for each speaker. These are referred to as Msmn, M standing for "measured," s indicating the replication (1 0 ^ 2 0 ) , m the AD' within the scaling domain (1 ^ra^3), and n the target within the AD' (1 ^ « ^ 2 ) . With only H* targets (T= + 1) Nand Emerge into a single range parameter R ( = TV^). For a given (fixed) value of Fr, the Rs for a particular replication is estimated as Ms]]/Fr. Sp= 1 and Sa= 1, so Psmn, the predicted (P) pitch value for the nth H* target in the rath AD' in the sth replication, is given by (13). v
*
smn
s
With Fr and the Rs fixed, we optimized dp and da using the same distance measure as before. In a subsequent procedure, we set dp and da at the optimal values obtained and optimized the Rs. We then calculated the goodness-of-fit index. 14.3.2.2 Results As before, we ran optimizing procedures for a number of different Fr values to obtain the best possible fit the model allows. We also optimized dp and da for both the "endpoint average" and the "intonational locus" estimates of Fr. Table 14.4 gives the results. The increase in the index values (compared to the best attainable in table 14.4a) is somewhat smaller for the "endpoint average" estimate of Fr than 357
Prosody
for the "international locus" estimate. The latter estimate has the advantage that dp is apparently speaker-independent. However, for both estimates the da values vary considerably across speakers. In fact, no Fr estimate led to speaker-independent values for both dp and da, but a combination of the "intonational locus" Fr, a dp of 0.90, and a da of 0.70 (as observed for Fl and Ml) would seem to be a reasonable choice to be used in a computer implementation of our model. For both Fr estimates and for all speakers, dp is larger than da, which supports our view that phrasal downstep entails a smaller step down than accentual downstep. Comparing the da values for the "locus" estimate across the two experiments, we observe that here da is lower (and sometimes considerably so) than in our study of accentual downstep. It would appear that speakers tend to use a smaller da, so a larger step down, if they have to realize accentual downstep in combination with phrasal downstep, possibly in an attempt to keep the two as distinct as possible. We conclude that (1) phrasal downstep can be modeled with a single downstep factor which is independent of prominence, but that (2) the answer to the question whether it is speaker-independent is determined by the way in which the general pitch-range parameter is defined. 14.4 Concluding remarks
In this paper we argued that the intonational phenomena of downstep and reset in Dutch should be modeled as the composite effect of phrasal downstep, the downward scaling of a phrase's (AD7) register relative to a preceding AD', and accentual downstep, the downward scaling of H* targets within the AD'. We presented data from four speakers (two male and two female) showing that both downstep factors are independent of prominence and that the accentual downstep factor only marginally depends on the number of (accentually) downstepped H*s. We further demonstrated that the values of both downstep factors depend on the value of the general pitch range parameter Fr and that an "intonational locus" estimate of this parameter gives speaker-independent values for the phrasal downstep factor and for the accentual downstep factor in a single AD7. With more AD's, no single speaker-independent da was found, but a value lower than the one for a single AD' appears to be appropriate. We know of no independent theoretical reasons to choose a particular estimate of general pitch range. For the implementation of the model, we therefore prefer the "intonational locus" estimate of the general pitch-range parameter (i.e. Fr is 100 for women and 50 for men), because this allows us to implement a single speaker-independent value for the phrasal downstep 358
14 Comments
factor {dp is 0.90) as well as for the accentual downstep factor {da is 0.80 in a single AD' and 0.70 if there are more AD's).
Comments on chapter 14 NINA GR0NNUM (formerly THORSEN) Introduction
My comments concern only van den Berg, Gussenhoven, and Rietveld's proposed analysis and model of Dutch intonation. As I am not a Dutch speaker, and I do not have first-hand knowledge of data on Dutch intonation, my comments are questions and suggestions which I would like readers and the authors to consider, rather than outright denials of the proposals. Nevertheless, it will be apparent from what follows that I think van den Berg, Gussenhoven, and Rietveld's description obscures the most important fact about accentuation in Dutch, and that it tends to misrepresent the relevant difference between contours in some instances because it disregards linguistic function (in a narrower as well as a wider sense). The purported phonological analysis thus nearly reduces to a phonetic transcription (though a broad one) and not always an adequate one at that, as far as I can judge. To mute a likely protest from the outset: I am not generally against trying to reduce rich phonetic detail to smaller inventories of segments, features or parameters: it is the nature of van den Berg, Gussenhoven, and Rietveld's description and its relevance to a functional description that I question. I begin with some general criticisms which bear upon the concrete examples below. First, it is difficult to evaluate the adequacy of a description which is based on examples, in the literal sense of the word, i.e. sample utterances recorded once, by one speaker. Second, we are led to understand that the various contours accompanying the same utterance are meaningfully different, but we are not told in which way they are different, what kind of difference in meaning is expressed, and whether or not Dutch listeners would agree with the interpretation. Third, I miss an account of the perceived prominence of the accented syllables in the examples, which might have been relevant to the treatment of downstep. And fourth, in that connection I miss some reflections about what accentual and phrasal downstep are for, what functions they serve.
359
Prosody
Accentuation From the publications of Cohen, Collier, and 't Hart (e.g. 't Hart and Cohen 1973; 't Hart and Collier 1975; Collier and 't Hart 1981), I have understood Dutch to epitomize the nature of accentuation: accented syllables are stressed syllables which are affiliated with a pitch change.1 Beyond that-as far as I can see - there are few restrictions, i.e. the change may be in either direction, it may be quick or slow, it may be coordinated with either onset or offset of the stressed syllable, and it may be bidirectional. Not all the logical combinations of directions, timings, and slopes occur, I suppose, but many of them do, as witnessed also by van den Berg, Gussenhoven, and Rietveld's examples. Nor does a pitch change necessarily evoke an accent, as for example when it is associated with unstressed syllables at certain boundaries. From this freedom in the manifestation of accent, a rich variety of patterns across multiaccented phrases and utterances arises.2 I would therefore like to enquire how far one could take a suggestion that the underlying mechanism behind accentuation in Dutch is pitch change, and that the particular choice of how pitch is going to change is a combination of (1) restrictions at phrase and utterance boundaries, connected with utterance function and pragmatics, (2) focus distribution, (3) degree and type of emphasis, (4) syntagmatic restrictions (i.e. certain pitch patterns cannot precede or follow certain other ones if pitch changes are to be effected and boundary conditions met), and (5) speech style, i.e. factors outside the realm of phonology/lexical representation. I realize, of course, that some of these factors (especially speech style and pragmatics) are universally poorly understood and I cannot blame van den Berg, Gussenhoven, and Rietveld for not incorporating them in their model. I do think, however, that they could have exploited to greater advantage the possibility of separating out effects from utterance function and boundary conditions, and they could at least have hinted at the possible effects of the two different elocutionary styles involved in their examples (complete utterances versus lists of place and 1 2
Here and throughout I will use "pitch" to refer to both F o and perceived pitch, unless I need to be more specific. Lest it be thought that I am implicitly comparing van den Berg, Gussenhoven, and Rietveld's analysis to Cohen, Collier, and 't Hart's, and preferring the latter: this is not so. Even if the latter's description in terms of movements is phonetically accurate, it would gain from being further reduced to a specification of target values, pinning down the perceived pitch (pitch level) of the accented syllables: many rapid local F o changes are likely to be produced and perceived according to their onsets or offsets, although we can (be trained to) perceive them as movements, especially when they are isolated. However, Cohen, Collier, and 't Hart's purpose and methodology (to establish melodic identity, similarity, categorization) apparently have not motivated such a reduction. A further difficulty with them is that they have not committed themselves to a specification of the functional/pragmatic circumstances under which specific combinations of accent manifestations are chosen. (Collier [1989] discusses the reasons why.) This latter is, however, equally true of van den Berg, Gussenhoven, and Rietveld.
360
14 Comments
proper names). As it is, they represent every contour in their examples as a sequence of L*H or H*L pitch accents, with utterance final L% and H% tones added, with conventions for representing slow and level movements, and a downstep morpheme to characterize series of descending H*s.3 Tonal representation of accents Let me first question the adequacy of some of the tonal representations that van den Berg, Gussenhoven, and Rietveld assign to accentual pitch movements, and thereby also challenge their claim that their model is "realistic." (By realistic I take them to mean a model which is relevant to speech production and perception, which models both speaker and listener behavior, and one which is not unduly abstract.) Low-pitchedfinalaccents Thefinalmovement in e.g.figures14.2a and 14.3b is rendered as H*L, i.e. the stressed syllable is associated with a H in both instances. My own experience with listening to speech while looking at Fo traces made me strongly doubt that the two are perceptually equivalent. I would expect the one in figure 14.3b to be perceived with a low-pitched stressed syllable. To check this, I recorded two utterances which were to resemble as closely as possible the one in figure 14.3b, globally and segmentally, but where one was produced and perceived with a high pitched the other with a low pitched stressed syllable at the end. The result is shown in figure 14.10. (Note that neither of these contours is an acceptable Standard Danish contour; intonationally they are nonsense.) The two traces look superficially alike, but the fall is timed differently with respect to the stressed vowel, corresponding to the high (figure 14.10a) and low (figure 14.10b) pitch level of the stressed syllable. Figure 14.10b rather resembles figure 14.3b, from which I infer that it should be given a L* representation. This would introduce a monotonal L* (or a L*L) to the system. The same suspicion attaches to the final !H* infigures14.5a, b, and 14.7a. I suspect that since van den Berg, Gussenhoven, and Rietveld have decided that utterances can only terminate in H*LL% or L*HH% (or H*LH%), and since figure 14.3b (and figures 14.5, 14.7a) is clearly not L*HH%, it must be H*LL%. I return to this on pages 363-4 below. I am, likewise, very skeptical about the reality of the assignment of !H*L to kleren, schoenen, and school in figure 14.7b. I would expect the perceptually salient part of such steep movements to correspond to the Fo offset rather 3
They do not mention tritonal pitch accents, and in the following I will assume that those H*LHs which appear in Gussenhoven (1988) have been reinterpreted as L*LH%, which is reasonable, considering his examples. 361
Prosody
100 L
100
i
l
a
m
t
r
d
n
u
e
h
a
100
centiseconds
100
centiseconds
u
Figure 14.10 Two Dutch intonation contours imposed on a Danish sentence, with a final stressed syllable that sounds high (upper plot) or low (lower plot). The Danish text is Lille Morten vil have mere manna ("Little Morten wants more manna")
than the onset of the stressed vowel. Is that not what led to the assignment of H*(L), rather than L*H, to mooiste, duurste (together with the acknowledgment that the Fo rise is a consequence of the preceding low value on de and of the syllabic structure with a voiced initial consonant and a long vowel, cf. beste in the last phrase)? In other words, if the Fo rises in mooiste, duurste are assigned tones in accordance with the offset of their Fo movements, why are the falls in kleren, schoenen, school not treated in the same way? Even though phonological assignment need not and should not exactly mirror phonetic detail, I think these assignments are contradictory. (An utterance where all the stressed syllables had short vowels in unvoiced obstruent surroundings, like beste, might have been revealing.) What are the criteria according to which, e.g. mooiste and kleren are variants of the same underlying pattern, rather than categorically different? Accentuation and boundaries
Now let me proceed to selected examples where I think van den Berg, Gussenhoven, and Rietveld's analysis may obscure the real issue. They state that Figure 14.2a has an appreciable prosodic boundary. What is so appreciable about it when none is present in the same location in figure 14.1a? Are we to understand that if the last word was spliced off from 14.1a and 14.2a, then on replaying them listeners would hear a prosodic boundary in 14.2a but not in 14.1a? If not, then the boundary is not a prosodic boundary per se, but a "rationalization after the fact" (of the presence of a 362
14 Comments
succeeding accent), i.e. something which is only appreciated when its conditioning factor is present. Indeed, it may be asked whether Dutch listeners would agree that such a boundary is present even in the full version of figure 14.2a. If not, the boundary is merely a descriptive device, invoked in order to account for the steep fall to L in figure 14.2a versus the slow decline in figure 14.3a. Would it be possible, instead, to consider the quick versus slow fall over -warde wil meer to be a property of the accent itself, connected to emphasis or speech style? (For example, do quick falls lend slightly more prominence to the accent? Are they characteristic of more assertive speech styles?) The boundary assignment in figure 14.2b is much more convincing, since it is associated with an extensive and rapid movement, affiliated with unstressed syllables. With such a boundary in 14.2b, is there any choice at all of pitch accent on Leeuwardenl That is, if a boundary is to be signaled unambiguously prosodically by a pitch change to the succeeding phrase, and if the succeeding phrase is specified to start low (whether by default or by a L%), then the only bitonal pitch accent Leeuwarden can carry is L*H. This is then not a choice between different lexical representations: the L*H manifestation is a choice forced by circumstances elsewhere. Finally, if mannen is more adequately represented as L* in figure 14.3b (see page 361 above), then the preceding unstressed syllable(s) must be higher, since the accent must be signaled by a change to the low man-. (Note that it cannot be signaled by a change to high/row man-, since this would violate demands for a prosodically terminal contour.) Thus, again, Leeuwarden cannot be H*L, it must be L*H. Summary I fundamentally agree with van den Berg, Gussenhoven, and Rietveld that there is a sameness about the first pitch accent across the (a)s infigures14.13, and likewise across the (b)s. I suggest, however, that this sameness applies to all the accented syllables, and that the different physical manifestations are forced by circumstances beyond the precincts of accentuation, and presumably also by pragmatic and stylistic factors which are not in the realm of phonology at all. Boundary tones
Van den Berg, Gussenhoven, and Rietveld state that they are uncertain about the occurrence of the L%, but they suppose it to characterize the termination of IPs (identical to utterances, in their examples). I find the concept of boundary tone somewhat suspect, when it can lead them to state 363
Prosody
that a T% copies the second T of the last pitch accent (if it is not pushed off the board by the preceding T), except in attested cases of H*LH%. A boundary tone proper should be an independently motivated and specified characteristic, associated with demands for intonation-contour termination connected to utterance function, in a broad syntactic and pragmatic sense. It cannot be dictated by a preceding T*T; on the contrary, it can be conceived of as being able to affect the manifestation of the preceding accent, as suggested above. Low boundary tone It may be useful to discuss L% and H% separately, because I think the L% can be done away with entirely. In figure 14.1a the authors admit that the (H*)L and the L% are not clearly different, but the presence of L% can be motivated on the basis of the final lowering. Now, if 14.1a justifies a L%, so does 14.1b-which is counterintuitive and nonsensical. Perhaps both of the slight final falls in 14.1a and b should be ascribed to weak edge vibrations before a pause, and something which might have been left out of consideration with different segmentation criteria. Furthermore, if there is any special final lowering command involved at the end of prosodically terminal utterances in Dutch, I would expect final falls from high to low to be of greater extent than nonfinal falls from high to low - and there is not a single such instance in the examples. For comparison, figure 14.11 shows some cases in which "final lowering" seems a meaningful description. I suggest that low termination in Dutch be considered the unmarked or default case, which characterizes prosodically terminal utterance. It does not require a special representation, but can be stated in terms of restrictions on the manifestation of the last pitch accent, H*L or L*. In the same way, low unstressed onset of phrases and utterances can also be considered unmarked. High boundary tone High termination seems to require a H%, and is not solely a matter of manifestation of the last pitch accent. This can be seen in figure 14.1b, where a rise is superposed on the relatively high level stretch, and in the attested cases of H*LH% mentioned by van den Berg, Gussenhoven, and Rietveld. But why should H% a priori be restricted to utterance boundaries? In this connection, compare the magnitude of the L*H interval in 14.1b to the rise in Leo, and Mary in figure 14.7a - where it seems that a H% might be necessary to account for the higher rises. 364
14 Comments HP semitones 20 15 10 5 0L
Kofoed og Thorsen skal med rutebilen fra Gudhjem til SnogebaBk klokken fire pa tirsdag
0
100
200
300
400
500 centiseconds
PBP semitones
15
n = 6
10
0L
Kofoed og Thorsen skal med rutebilen fra Tingler til Tonder klokken fire pa tirsdag
0
100
200
300
400
500 centiseconds
JOW semitones 15
n = 2
10 5 Hanna und Markus werden am Donnerstag Nachmittag mit dem Autobus von Hamburg nach Kassel fahren
0
100
300
200
400
500 centiseconds
Figure 14.11 Three contours illustrating "final lowering" in the sense introduced in the text. In each case the final fall from High to Low spans a considerably wider pitch range than the preceding non-final falls. The top two contours are from regional varieties of standard Danish, while the bottom contour is (northern) standard German; more detail can be found in Gronnum (forthcoming), from which thefiguresare taken 365
Prosody
Downstep Accentual downstep Could the lack of downstep in figures 14.2a and 14.3a be due to the second accented syllable being produced and perceived as more prominent than the first one? (See Gussenhoven and Rietveld's [1988] own experiment on the perception of prominence of later accents with identical F o to earlier ones.) If uneven prominence does indeed account for an apparent lack of downstep within prosodic phrases, there would be no need to introduce a downstep morpheme, successive lowering of H*s would characterize any succession of evenly prominent H*s in a phrase. The disruption of downstep would be a matter of pragmatic and speech style effects (which I suspect to be at work in figure 14.7b), not of morphology. In my opinion, the authors owe the reader an account of the meaning of [!]: when does it and when does it not appear? Van den Berg, Gussenhoven, and Rietveld find no evidence of downstepping L*s. In fact, it appears from their examples that low turning points, be they associated with stressed syllables or not, are constantly low during the utterance, and against this low background the highs are thrown in relief and develop their downward trend. This makes sense if the Ls are produced so as to form a reference line against which the upper line (smooth or bumpy as the case may be) can be evaluated - which is the concept of Garding's (1983) model for Swedish by whom van den Berg, Gussenhoven, and Rietveld claim to have been inspired. With such a view of the function of low tones, one would not expect L*s to downstep, and one would look for the manifestation of unequal prominence among L*s in the interval to a preceding or succeeding H. Phrasal downstep In the discussion of how to conceive of and model resets, I think the various options could have been seen in the light of the function resetting may serve. If it is there for purely phonetic reasons, i.e. to avoid going below the bottom of the speaking range in a long series of downstepping H*s, then it is reasonable to expect that H*s after the first reset one would also be higher. If resetting is there for syntactic or pragmatic reasons - to signal a boundary it would be logically necessary only to do something about the first succeeding H*. However, I imagine that if consecutive H*s did not accompany the reset one upwards, then the reset one would be perceived as being relatively more prominent. To avoid that, the only way to handle phrasing in this context is to adjust all the H*s in the unit. Yet it is doubtful whether the phenomenon is properly labeled a shift of register, since the L*s do not appear to be affected (cf. the phrase terminations in figures 14.6 and 14.7a). It seems to be the upper lines only which are shifted. 366
14 Comments
Otherwise, I entirely endorse van den Berg, Gussenhoven, and Rietveld's concept of a "wheels-within-wheels" model where phrases relate to each other as wholes and to the utterance, because it brings hierarchies and subordination back into the representation of intonation. In fact, their "lowering model" (i.e. one which lowers successive phrase-initial pitch accents [to which accentual downstep then refers], rather than one which treats phrasal onsets as a step up from the last accent in the preceding phrase) is exactly what I suggested a "tone sequence" representation of intonation would require in order to accommodate the Danish data on sentences in a text: "Consecutive lowering of individual clauses/sentences could be handled by a rule which downsteps thefirstpitch accent in each component relative to the first pitch accent in the preceding one" (Thorsen 1984b: 307). However, data from earlier studies (Thorsen 1980b, 1981, 1983) make it abundantly clear that the facts of accentual downstep within individual phrases in an utterance or text are not as simple as van den Berg, Gussenhoven, and Rietveld represent them. Conclusion
I have been suggesting that perhaps Dutch should not be treated along the same lines as tone languages or word-accent languages, or even like English. Perhaps the theoretical framework does not easily accommodate Dutch data; forcing Dutch intonation into a description in terms of sequences of categorically different, noninteracting pitch accents can be done only at the expense of phonetic (speaker/listener) reality. But even accepting van den Berg, Gussenhoven, and Rietveld's premise, I think their particular solution can be questioned on the grounds that it brings into the realm of prosodic phonology phenomena that should properly be treated as modifications due to functions at other levels of description.
367
Modeling syntactic effects on downstep in Japanese HARUO KUBOZONO
15.1 Introduction
One of the most significant findings about Japanese intonation in the past decade or so has been the existence of downstep*. At least since the 1960s, the most widely accepted view had been that pitch downtrend is essentially a phonetic process which occurs as a function of time, more or less independently of the linguistic structure of utterances (see e.g. Fujisaki and Sudo 1971a). Against this view, Poser (1984) showed that downtrend in Japanese is primarily due to a downward pitch register shift ("catathesis" or "downstep"), which is triggered by (lexically given) accents of minor intonational phrases, and which occurs iteratively within the larger domain of the socalled major phrase. The validity of this phonological account of downtrend has subsequently been confirmed by Beckman and Pierrehumbert (Beckman and Pierrehumbert 1986; Pierrehumbert and Beckman 1988) and myself (Kubozono 1988a, 1989).1 Consider first the pair of examples in (1). (1) a. b.
uma'i nomi'mono "tasty drink" amai nomi'mono "sweet drink"
The phrase in (la) consists of two lexically accented words while (lb) T h e research reported on in this paper was supported in part by a research grant from the Japanese Ministry of Education, Science, and Culture (no. 01642004), Nanzan University Pache Research Grant IA (1989) and travel grants from the Japan Foundation and the Daiko Foundation. I am grateful to the participants in the Second Conference on Laboratory Phonology, especially the discussants, whose valuable comments led to the improvement of this paper. Responsibility for the views expressed is, of course, mine alone. 1 It must be added in fairness to Fujisaki and his colleagues that they now account for phonemena analogous to downstep by positing an "accent-level rule," which resembles McCawley's (1968) "accent reduction rule" (see Hirose et al. 1984; Hirose, Fujisaki, and Kawai 1985). 368
75 Haruo Kubozono
consists of an unaccented word (of which Tokyo Japanese has many) and an accented word. Downstep in Japanese looks like figures 15.1a and 15.2 (solid line), where an accented phrase causes the lowering of pitch register for subsequent phrases, accented and unaccented alike, in comparison with the sequences in which the first phrase is unaccented (i.e. figures 15.1b and 15.2, dotted line). The effect of downstep can also be seen from figure 15.3, which shows the peak values of the second phrase as a function of those of the first. The reader is referred to Kubozono (1989) for an account of the difference in the height of the first phrases. Downstep thus defined is a rather general intonational process in Japanese, where such syntactic information as category labels are essentially irrelevant. Previous studies of downstep in Japanese have concentrated on the scaling of the peak values of phrases, or the values of High tones, which constitute the ceiling of the pitch register. There is, on the other hand, relatively little work on the scaling of valley values or Low tones, which supposedly constitute the bottom of the register. This previous approach to the modeling of downstep can be justified in most part by the fact, reported by Kubozono (1988a), that the values for Low tones vary considerably depending on factors other than accentual structure. Downstep seems to play at least two roles in the prosodic structure of Japanese. It has a "passive" function, so to speak, whereby its absence signals a sort of pragmatic "break" in the stream of a linguistic message (see Pierrehumbert and Beckman 1988; Fujimura 1989a). A second and more active role which this intonational process plays is to signal differences in syntactic structure by occurring to different degrees in different syntactic configurations. As we shall see below, however, there is substantial disagreement in the literature as to the details of this syntax-prosody interaction, and it now seems desirable to consider more experimental evidence and to characterize this particular instance of syntax-prosody interaction in the whole prosodic system of Japanese. With this background, this paper addresses the following two related topics: one is an empirical question of how the downward pitch shift is conditioned by syntactic structure, while the other is a theoretical question of how this interaction can or should be represented in the intonational model of the language. 15.2 Syntactic effects on downstep
In discussing the interaction between syntax and phonology in Japanese, it is important to note the marked behavior of right-branching structure. In syntactic terms, Japanese is a "left-branching language" (Kuno 1973), which takes the left-branching structure (as against the right-branching structure) in various syntactic constructions such as relative-clause constructions. It has 369
Prosody
180 " Pi 160 •
140 -
P2
* 1
v2
120 -
v3
100 -
(*)
1
•«-/
180 •
160 140 •
P2
Pi
• V2
Vi
v3
120 100 -
Figure 15.1 (a) A typical pitch contour for phrase (la), pronounced in the frame sorewa ... desu ("It is ..."); (b) A typical pitch contour for phrase (lb), pronounced in the frame sorewa ... desu ("It is ...")
been made clear recently that the right-branching structure is marked in phonology as well, in that it blocks prosodic rules, both lexical and phrasal. This is illustrated in (2) where the right-branching structure blocks the application of prosodic rules which would otherwise unify two (or more) syntactic/morphological units into one prosodic unit.2 2
Interestingly, Clements (1978) reports that prosodic rules in Anlo Ewe and in Italian are also sensitive to right-branching structure (which he defines by the syntactic notion "left-branch"). Selkirk and Tateishi (1988a, b) propose a slightly different generalization, which will be discussed below in relation to downstep. 370
15 Haruo Kubozono (Hz) 160
150 140 130 120 110
v2
Pi
p2
v3
Figure 15.2 Schematic comparison of the pitch patterns of (la)-type utterances (solid line) and (lb)-type utterances (dotted line), plotted on the basis of averaged values at peaks and valleys
(Hz) [Pd 170
160
150
140
140
150
160
170
(Hz) [Pi]
Figure 15.3 Distribution of PI and P2 in figure 15.2: P2 values as a function of PI values in utterances of type (la) (circles) and type (lb) (squares) 371
Prosody (2) a. b. c. d. e.
[A B] -> AB [[A B] C] -> ABC [A [B C]] -> A/BC [[A [B C]] D] -> A/BCD [A [B [C D]]] -> A/B/CD
Examples of such prosodic rules include (a) lexical rules like compound formation ("compound accent rules"; see Kubozono 1988a, b) and sequential voicing rules also characteristic of the compound process (Sato n.d.; Otsu 1980), and (b) phrasal rules such as the intonational phrasing process generally known as "minor phrase formation" (see Fujisaki and Sudo 1971a; Kubozono 1988a). Given that Japanese shows a left-right asymmetry in these prosodic processes, it is natural to suspect that the right-branching structure shows a similar marked behavior in downstep as well. The literature, however, is divided on this matter. Poser (1984), Pierrehumbert and Beckman (1988), and Kubozono (1988a, 1989), on the one hand, analyzed many three-phrase right-branching utterances and concluded that the register shift occurs between the first two phrases as readily as it occurs between the second and third phrases in this sentence type: the second phrase is "phonologically downstepped" as relative to the first phrase in the sense that it is realized at a lower level when preceded by an accented phrase than when preceded by an unaccented phrase. On the other hand, there is a second group who claim that downstep is blocked between the first two phrases in (at least some type of) the rightbranching structure. Fujisaki (see Hirose, Fujisaki, and Kawai 1985), for instance, claims that the "accent level rule," equivalent to our downstep rule (see note 1), is generally blocked between the first two phrases of the rightbranching utterances. Fujisaki's interpretation is obviously based on the observation that three-phrase right-branching utterances generally show a markedly higher pitch than left-branching counterparts in the second component phrase, a difference which Fujisaki attributes to the difference in the occurrence or absence of downstep in the relevant position (see figure 15.4 below). Selkirk and Tateishi (1988a, b) outline a similar view, although they differ from Fujisaki in assuming that downstep is blocked only in some type of right-branching utterances. They report that downstep is blocked in the sequences of [Noun-«o [Noun-H0 Noun]] (where no is a relativizer or genitive particle) as against those of [Adjective [Adjective Noun]], taking this as evidence for the notion of "maximal projection" in preference to the generalization based on branching structure (see Selkirk 1986 for detailed discussion). Selkirk and Tateishi further take this as evidence that the "major phrase," the larger intonational phrase defined as the domain of downstep, can be defined in a general form by this new concept. 372
75 Haruo Kubozono
With a view to reconciling these apparently conflicting views in the literature, I conducted experiments in which I analyzed utterances of the two different syntactic structures made by two speakers of Tokyo Japanese (see Kubozono 1988a and 1989 for the details of the experimental design and statistical interpretations). In these experiments various linguistic factors such as the accentual properties and phonological length of component elements were carefully controlled, as illustrated in (3); the test sentences included the two types of right-branching utterances such as those in (4) and (5) which, according to Selkirk and Tateishi, show a difference in downstep. (3) a. b.
(4) a. b. (5) a. b.
[[ao'yama-ni a'ru] daigaku] Aoyama-in exist university "a university in Aoyama" [ao'yama-no [a'ru daigaku]] Aoyama-Gen certain university "a certain university in Aoyama" [a'ni-no [me'n-no eri'maki]] brother-Gen cotton-Gen muffler "(my) brother's cotton muffler" [ane-no [me'n-no eri'maki]] "(my) sister's cotton muffler" [ao'i [o'okina eri'maki]] "a blue, big muffler" [akai [o'okina eri'maki]] "a red, big muffler"
The results obtained from these experiments can be summarized in three points. First, they confirmed the view that the two types of branching structure exhibit distinct downtrend patterns, with the right-branching utterances generally showing a higher pitch for their second elements than the left-branching counterparts, as illustrated in figure 15.4.3 The distribution of the values for Peak2(P2), shown in figure 15.5, does not hint that the averaged pitch values for this parameter represent two distinct subgroups of tokens for either of the two types of branching structure (see Beckman and Pierrehumbert, this volume). Second, it was also observed that the pitch difference very often propagates to the third phrase, as illustrated in figure 15.6, suggesting that the difference in the second phrases represents a difference in the height of pitch register, not a difference in local prominence. Third and most important, the experimental results also confirmed the observation previously made by Poser, Pierrehumbert and Beckman, and 3
Analysis of the temporal structure of the two patterns has revealed no significant difference, suggesting that the difference in the pitch height of the second phrase is the primary prosodic cue to the structural difference. 373
Prosody (Hz)
160 140 120 100 80 60
Vi
Pi
V2
P2
V3
P3
V4
Figure 15.4 Schematic comparison of the pitch patterns of (3a) (dotted line) and (3b) (solid line), plotted on the basis of averaged values at peaks and valleys
140 •
o o
120 •
100 • :a
•
.---•"
a
80 •
120
140
160
[Pi]
Figure 15.5 Distribution of PI and P2 in figure 15.4: P2 values as a function of PI values in utterances of (3a) (squares) and (3b) (circles)
myself that the right-branching structure (as well as the left-branching structure) undergoes downstep in the phonological sense. This can be seen from a comparison of the two sentences in each of the pairs in (4) and (5) with 374
15 Haruo Kubozono
Pal 1
100 ooo
o
o
80
60
80
100
120
140
[P 2 ]
Figure 15.6 Distribution of P2 and P3 in figure 15.4: P3 values as a function of P2 values in utterances of (3a) (squares) and (3b) (circles)
respect to the height of the second component phrases. Figures 15.7 and 15.8 show such a comparison of the pairs in (4) and (5) respectively, where the peak values of the second phrase are plotted as a function of those of the first phrase. The results in thesefiguresreveal that the second phrase is realized at a lower pitch level when preceded by an accented phrase than when preceded by an unaccented phrase. Noteworthy in this respect is the fact that the two types of right-branching constructions discriminated by Selkirk and Tateishi are equally subject to downstep and, moreover, show no substantial difference from each other in downstep configuration. In fact, the effect of downstep was observed between the first two phrases of right-branching utterances irrespective of whether the component phrases were a simple adjective or a Noun-«o sequence, suggesting that at least as far as the results of my experiments show, it is the notion of branching structure and not that of maximal projection that leads to a linguistically significant generalization concerning the occurrence of downstep; in the absence of relevant data and information, it is not clear where the difference between Selkirk and Tateishi's experimental results and mine concerning the interaction between syntax and downstep come from - it may well be attributable to the factor of speaker strategies discussed by Beckman and Pierrehumbert (this volume). The observation that there are two distinct downstep patterns and that they can be distinguished in terms of the branching structure of utterances is 375
Prosody [P 2 ]
130
120
110
120
130
140
150
Pi]
Figure 15.7 Distribution of PI and P2 for the two sentences in (4): P2 values as a function of PI values in utterances of (4a) (circles) and (4b) (squares)
[P2] 150
140
130
o o
o
120
110
130
120
140
150
[Pi]
Figure 15.8 Distribution of PI and P2 for the two sentences in (5): P2 values as a function of PI values in utterances of (5a) (circles) and (5b) (squares) 376
75 Haruo Kubozono
further borne out by experiments in which longer stretches of utterances where analyzed. Particularly important are the results of the experiments in which two sets of four-phrase sentences, given in (6) and (7), were analyzed. Each set consists of three types of sentence all involving an identical syntactic structure (symmetrically branching structure) but differing in the accentedness of their first and/or second elements. (6) a. [[na'oko-no a'ni-no][ao'i eri'maki]] "Naoko's brother's blue muffler" b. [[na'oko-no ane-no][ao'i eri'maki]] "Naoko's sister's blue muffler" c. [[naomi-no ane-no][ao'i eri'maki]] "Naomi's sister's blue muffler" (7) a. [[na'oko-na a'ni-wa][ro'ndon-ni imasu]] "Naoko's brother is in London" b. [[na'oko-no ane wa][ro'ndon-ni imasu]] "Naoko's sister is in London" c. [[naomi-no ane-wa][ro'ndon-ni imasu]] "Naomi's sister is in London" Utterances of the first type, (6a) and (7a), typically show a pitch pattern like that in figure 15.9, in which the peak of the third phrase is usually higher than that of the second phrase. A glance at this pattern alone suggests that downstep is blocked between these two elements and that some sort of prosodic boundary must be posited in this position. Comparison of this F o pattern with those of other accentual types, however, suggests that this interpretation cannot be justified. The relationship between the height of the third phrase and the accentedness of the preceding phrases is summarized in figure 15.10, in which the peak values of the third phrase are plotted as a function of the peak values of the first phrase. This figure shows that in each of the three cases considered, the peak values for the third phrase are basically linear functions of those for the first phrase, distributed along each separable regression line. What this figure suggests is that the third component phrase is lowered in proportion to the number of accents in the preceding context, with the phrase realized lower when preceded by one accent than when preceded by no accent, and still lower when preceded by two accents. Leaving aside for the moment the fact that the third phrase is realized at a higher level than the second phrase, the result in this figure suggests that downstep has occurred iteratively in the utterances of (6a)/(7a) type without being blocked by the right-branching structure involved. This, in turn, suggests that occurrence of downstep cannot be determined by comparing the relative height of two successive minor phrases observed at the phonetic output (see Beckman and Pierrehumbert, this volume). 377
Prosody (Hz)
180
160
140
120
Vi
Pi
V2
P2
V3
P3
V4
P4
V5
Figure 15.9 Schematic pitch contour of (6a)-type utterances, plotted on the basis of averaged values at peaks and valleys
Pal (Hz)l 170
160
150
140
170
180
190
200
(Hz)
Figure 15.10 Distribution of PI and P3 for the three sentences in (6): P3 values as a function of PI values in utterances of (6a) (circles), (6b) (squares) and (8c) (crosses) 378
75 Haruo Kubozono
To sum up the experimental evidence presented so far, it can be concluded that downstep occurs irrespective of the branching structure involved but that it occurs to a lesser extent between two elements involving a rightbranching structure than between those involving a left-branching structure. Viewed differently, this means that downstep serves to disambiguate differences in branching structure by the magnitudes in which it occurs. Seen in the light of syntax-prosody mapping, it follows from these consequences that occurrence or absence of downstep cannot be determined by the branching structure of utterances, as supposed by Fujisaki, or in terms of the notion of "maximal projection," as proposed by Selkirk and Tateishi. It follows, accordingly, that the "major phrase" cannot be defined by syntactic structure in a straightforward manner; rather, that the mapping of syntactic structure onto prosodic structure is more complicated than has been previously assumed (see the commentary to chapters 3 and 4 by Vogel, this volume). 15.3 Modeling syntax-downstep interaction
Given that left-branching and right-branching structures show different configurations of downstep, there are two approaches accounting for this fact. One is to assume two types of downstep which occur in different magnitudes, one applying over right-branching nodes and the other applying elsewhere. The other possibility is to assume just one type of downstep and attribute the syntactically induced difference in question to a phonetic realization rule independent of downstep. Leaving aside for the moment the first approach (which is fully described by van den Berg, Gussenhoven, and Rietveld, this volume), let us explore the second approach in detail here and consider its implications for the modeling of Japanese intonation. 15.3.1 Metrical boost To account for the observed difference in downstep pattern, I proposed the concept of "metrical boost" (MB), or an upstep mechanism controlling the upward shifting pitch register, which triggers a global pitch boost in the right-branching structure on to the otherwise identical downstep pattern (i.e. the gradually declining staircase defined by the phonological structure). The effect of this upstep mechanism is illustrated in figure 15.11; see Kubozono (1989) concerning the question of whether the effects of multiple boosts are cumulative. This analysis is capable of accounting in a straightforward manner for the paradoxical case discussed in relation to figure 15.9, the case where a downstepped phrase is realized at a higher pitch level than the phrase which 379
Prosody (Hz)l 160 140 120 100 80 60
V2
P2
V3
P3
V4
Figure 15.11 Effect of metrical boost in (3b)-type utterances: basic downstep pattern (dotted line) and downstep pattern on which the effect of metrical boost is superimposed (solid line)
triggers the downward register shift. Under this analysis, it can be understood that the downstepped (i.e. third) phrase has been raised by the phonetic realization rule of metrical boost to such an extent that it is now realized higher than the second minor phrase (fig. 15.12). This case is a typical example where the syntactically induced pitch boost modifies the phonologically defined downstep pattern. Syntactically more complex utterances show further complex patterns in downstep as shown in Kubozono (1988a), but all these patterns can be described as interactions of downstep and metrical boost, the two rules which control the shifting of pitch register in two opposite directions. The notion underlying the rule of metrical boost is supported by yet another piece of evidence. In previous studies of Japanese intonation, it is reported that sudden pitch rises occur at major syntactic boundaries such as sentence, clause, and phrase boundaries. These "juncture phenomena" have been explained by way of the "resetting" of pitch register or other analogous machinery in intonational models (see Han 1962; Hakoda and Sato 1980; Uyeno et al. 1981; Hirose et al. 1984). In the sentences given in (8), for example, remarkable degrees of pitch rise reportedly occur at the beginning of the fourth phrase in (8a) and the second phrases in (8b) and (8c). 380
15 Haruo Kubozono (Hz) 180
160
140
120
Vi
Pi
V2
P2
V3
P3
V4
P4
V5
Figure 15.12 Effect of metrical boost in (6a)/(7a)-type utterances: basic downstep pattern (dotted line) and downstep pattern on which the effect of metrical boost is superimposed (solid line)
(8) a. b.
c.
[[A [B C]][[ D E][F G]]] [[bo'bu-wa [ro'ndon-ni ite]][[sono a'ni-wa][njuuyo'oku-ni imasu]]] "Bob is in London, and his brother is in New York" [A [[[B [C D]] E] F]] [kino'o [[[ro'ndon-de [hito-o korosita]] otoko'-ga] tukama'tta]] yesterday London-in person-Obj killed man-Nom was caught "A man who killed a person in London was caught yesterday" [A [[B C] D]] [Zjo'n-to [[bo'bu-no imooto-no] me'arii]] John-and Bob-Gen sister-Gen Mary "John and Mary, Bob's sister"
If these phonemena are analyzed in terms of the branching structure of the utterances, it will be clear that the sudden pitch rises occur where rightbranching structure is defined in a most multiple fashion, that is, at the beginning of the element immediately preceded by double or triple "left branches" (see Clements 1978). If metrical boost is redefined as a general upstep process, in fact, all these so-called juncture phenomena can now be treated as different manifestations of the single rule of metrical boost. In other words, it can be said that pitch register is raised at "major syntactic boundaries" not directly because of the syntactic boundaries but because right-branching structure is somehow involved. 381
Prosody
According to my previous experiments (Kubozono 1988a), moreover, there is evidence that the magnitudes of pitch rises (or, upward pitch register shift, to be exact) in each syntactic boundary can largely be predicated by the depth of right-branching structure. Consider the three sentences in (9), for example, all of which involve the marked right-branching structure (i.e. leftbranches in the bracket notation employed here) between the first and second elements. These sentences show different degrees of upstep at the beginning of the second phrases with (9a) exhibiting a considerably greater effect than the other two cases. (9) a. b. c.
[A [[B C] D]] [ao'i [[na'oko-ga a'nda] eri'maki]] "blue muffler which Naoko knitted' [[A [B C]] D] [[na'oko-no [ao'i eri'maki-no]] nedan] "Naoko's blue muffler's price" [A [B [C D]]] [na'oko-no [ao'i [o'okina eri'maki]]] "Naoko's blue big muffler"
In this light, it can be understood that the reported tendency of sentence boundaries to induce a greater degree of upstep than clause or phrase boundaries is simply attributable to the fact that sentence boundaries often, if not always, involve a greater depth of right-branching structure than other types of syntactic boundaries. This line of generalization points to a certain difference between the two types of pitch register shift: downstep, or the downward register shift, is a binary process, whereas the upstep mechanism of metrical boost is an «-ary process. This characterization of metrical boost is worth special attention because it enables us to eliminate the conventional rule of pitch register reset from the intonational system of Japanese. Moreover, it enables us to make a generalization as to the linguistic functions of pitch register shifts in Japanese in such a way that lexical information (word accent) and phrasal information (syntactic consituency) are complementary in the use of pitch features. 15.3.2 Implications for intonational representation Having understood that the upstep rule of metrical boost can be well supported in the intonational system of Japanese, let us finally consider the implications of this analysis for the organization of intonational representation and the interaction between syntax, phonology, and phonetics in the language. The most significant implication that emerges is that the phonetic (realization) rule of metrical boost requires information concerning the hierarchy of 382
15 Haruo Kubozono
syntactic structure. Given the orthodox view that intonational representation is the only source of information available for intonational (i.e. phonetic realization) rules to derive correct surface pitch patterns, it follows that the intonational structure itself is hierarchically organized, at least to such an extent that left branching and right-branching structures can be readily differentiated in the domain where downstep is denned. If we define the major phrase (MP) as the domain where downstep occurs between adjacent minor phrases (m.p.), the conventional intonational model can be illustrated in (10). Obviously, this model is incapable of describing the kind of syntactic information under consideration. (10)
m.p.
m.p.
m.p.
There seem to be two possible solutions to this problem. One is to revise this intonational representation slightly and to posit an intermediate intonational phrase or level (represented as "IP" here) between the minor phrase and the major phrase.4 In fact, this is the only possible revision that can be made of the conventional representation insofar as we take the position that intonational structure involves an H-ary branching structure. Alternatively, it is also possible to take a substantially different approach to intonational phrasing by assuming that intonational structure is binary branching. Under the first approach, the difference between right-branching and leftbranching syntactic structures is represented as a difference in the number of intermediate intonational phrases involved, as shown in (11): (11)
a.
b
Utterance
Utterance
Utterance
MP
MP
MP
IP m .p. m.p. m.p. 4
c.
I>
IP
n l.p. m.p. m.p.
IP
A m / \
IP
/ A\
.p. m.p. m.p. m.p
«jp» m u s t n o t be confused with the "intermediate phrase" posited by Beckman and Pierrehumbert, which they define as the domain of downstep and hence corresponds to our "major phrase."
383
Prosody
(lla) and (lib) are the representations assigned to the two types of threephrase sentence in (3a) and (3b) respectively, while (1 lc) is the representation assigned to the symmetrically branching four-phrase sentences in (6a)/(7a). Given the representations in (11), it is possible to account for the syntactic effect on downstep in two different ways: either we assume that the rule of metrical boost raises pitch register upwards at the beginning of IPs that do not begin a MP, or we assume, as proposed by van den Berg, Gussenhoven, and Rietveld (this volume), that downstep occurs to varying degrees depending upon whether it occurs within an IP or over two IPs, to a lesser degree in the latter case than in the former. Of these two analyses based on the model in (11), thefirstanalysis falls into several difficulties. One of them is that the motivation of positing the new intonational phrase (level) lies in and only in accounting for the syntactic effect upon downstep. Moreover, if this analysis is applied to syntactically more complex sequences of phrases, it may indeed eventually end up with assuming more than one such additional phrase between the minor phrase and the MP. If this syntactic effect on downstep can be handled by some other independently motivated mechanism, it would be desirable to do without any additional phrase or level. A second and more serious problem arises from the characterization of IP as the trigger of metrical boost. It has been argued that metrical boost is a general principle of register shift whose effects can be defined on an «-ary and not binary basis. If we define occurrence of metrical boost with reference to "IP," we would be obliged to posit more than one mechanism for upward pitch register shifts, that of MB, which applies within the major phrase, and that of the conventional reset rule, which applies at the beginning of every utterance-internal MP. This is illustrated in a hypothetical intonational representation in (12). (12)
Utterance
m.p.
m.p.
m.p.
m.p.
TT
T
Reset
m.p.
m.p. m.p
T
MB
m.p.
MB Again, it would be desirable to posit a single mechanism rather than two if consequences in both cases are the same. Thus, addition of a third intonational level fails to account for the «-ary nature of the upstep mechanism, thereby failing to define it as a general principle controlling upward register shift in Japanese. Correspondingly, this second problem gives rise to a third 384
75 Haruo Kubozono problem with the revised representation. That is, this analysis assumes two types of the intermediate phrase, MP-initial IPs which do not trigger the boost and MP-internal IPs which do trigger it. Similarly, the analysis proposed by van den Berg, Gussenhoven, and Rietveld (this volume) poses several difficult problems. While this analysis may dispense with the principle of metrical boost as far as downstep is concerned, it falls into the same difficulties just pointed out. Specifically, it fails to capture the general nature of the upstep principle which can be defined on the basis of syntactic consituency, requiring us instead to posit either the conventional rule of register reset or a third type of downstep in order to account for the upward register shift occurring at the beginning of every utterance-internal major phrase (see (12)). If the revised model illustrated in (11) is disfavored because of these problems, the only way to represent the difference of syntactic structure in intonational representation will be to take a substantially different approach to intonational phrasing by abandoning the generally accepted hypothesis that intonational structure is «-ary branching. Noteworthy in this regard is the recursive model proposed by Ladd (1986a), in which right-branching and left-branching structures can be differentiated in a straightforward manner by a binary branching recursive mechanism. Under this approach, the two types of pitch pattern in figure 15.4 can be represented as in (13a) and (13b) respectively, and the four-phrase pattern in figure 15.9 as in (13c). (13)
b.
a.
Utterance
Utterance
Utterance
MP
MP
MP
m.p. m.p. m.p.
/ m.p. m.p. m.p.
m.p. m.p. m.p. m.p.
T
Upstep
T
Upstep
Interpreting these intonational representations at the phonetic level, the rule of metrical boost raises pitch register in response to any right-branching structure defined in the hierarchical representation. The recursive intonational structure represented in this way represents something very directly related to syntactic constituency, and is different from the structure as proposed by Poser, and Pierrehumbert and Beckman, which is more or less distinct from syntactic structure and reflects syntactic structure only indirectly. 385
Prosody
This new approach can solve all the problems with the conventional approach. That is, it is not necessary to posit any additional intonational level/phrase which lacks an independent motivation. Nor is it necessary to postulate more than one mechanism for upstep phenomena because occurrence and magnitudes of upsteps can be determined by the prosodic constituency rather than by intonational category labels, as illustrated in (14). (14)
Utterance
15.4 Concluding remarks
The foregoing discussions can be summarized in the following three points. First, as for the empirical question of how the intonational process of downstep interacts with syntactic structure, experimental evidence suggests that the register shift occurs in both of the two types of branching structure, left-branching and right-branching, and yet disambiguates them by means of the different degrees to which it occurs. Second, the effects of syntax on downstep patterns can be modeled by the phonetic realization rule termed "metrical boost," which triggers a step-like change in the pitch register for the rest of the utterance, working something like the mechanism of downstep but in the opposite direction. Moreover, by defining this rule as having an nary effect whose degrees are determined by the depth of right-branching structure involved - or the number of "left branches" in bracket notation - it is possible to generalize the effect of this upward register shift on the basis of syntactic constituency. More significantly, this characterization of metrical boost enables us to dispense with the conventional rule of register reset in the intonational system of Japanese. Given this line of modeling of the syntax-downstep interaction, the evidence for the syntactic effects on downstep patterns casts doubt upon the conventional hypothesis that intonational representation involves an «-ary branching and flat structure. It speaks, instead, for a substantially different approach to intonational structure as proposed by Ladd (1986a). There may be alternative approaches to this problem (see Beckman and Pierrehumbert's 386
14 and 15 Comments
commentary in this section), but the evidence presented in this paper suggests that the new approach is at least worth exploring in more depth.
Comments on chapters 14 and 15 MARY BECKMANand JANET PIERREHUMBERT Chapters 14 and 15 are about downstep and its interactions with phrasal pitch range, local prominence, and the like - things that cause variation in the fundamental frequency (Fo) values that realize a particular tonal event. Sorting out this Fo variation is like investigating any other physical measure of speech; making the measure necessarily involves making assumptions about its phonetic control and its linguistic function, and since the assumptions shape the investigation, whether acknowledged or not, it is better to make them explicit. Leaving aside for the moment assumptions about the control mechanism, we can classify the assumed linguistic function along two dimensions. The first is categorical versus continuous: the variation in the measure can function discretely to symbolize qualitatively different linguistic categories of some kind or it can function continuously to signal variable quantities of some linguistic property in an analogue way. The second is paradigmatic versus syntagmatic: values of the measure can be freely chosen from a paradigm of contrasting independent values or they can be relationally dependent on some other value in the context. Describing downstep along these dimensions, we could focus on different things, depending on which aspect we are considering at which level of representation. The aspect of the phonological representation which is relevant to downstep is fundamentally paradigmatic; it is the tone string, which depicts a sequence of categorical paradigmatic choices. A syntagmatic aspect of downstep comes into focus when we look at how it is triggered; the pitch range at the location of a given tone is lowered, but the lowering depends upon the tone appearing in a particular tonal context. In Hausa, where downstep applies at any H following a HL sequence, this contextual dependence can be stated in terms of the tone string alone. In Japanese and English, the context must be specified also in terms of a structural feature how the tones are organized into pitch accents. The syntagmatic nature of downstep comes out even more clearly in its phonetic consequences; the new value is computed for the pitch range relative to its previous value. When this relational computation applies iteratively in long sequences, the step levels resulting from the categorical input of H and L tone values tend toward a continuous scale of Fo values. 387
Prosody
These aspects of downstep differentiate it from other components of downtrend. The phonological representation of the categorical input trigger, for example, differentiates downstep from any physiological downtrend that might be attributed to consequences of respiratory function (e.g. Maeda 1976). The phonetic representation of the syntagmatic choice of output values differentiates it also from the use of register slope to signal an interrogative or some degree of finality in a declarative, as in Thorsen's (1978, 1979) model of declination, or from the use of initial raising and final lowering of the pitch range to signal the position of the phrase in the discourse hierarchy of old and new topics, as proposed by Hirschberg and Pierrehumbert (1986). In either of these models, the declination slope or the initial orfinalpitch range value is an independent paradigmatic choice that is linguistically meaningful in its own right. These definitional assumptions about its linguistic function suggest several tactics for designing experiments on downstep and its domain. First, to differentiate downstep from other components of downtrend, it is important to identify precisely the phonological trigger. This is trivial in languages such as Hausa or Japanese, where the trigger involves an obvious lexical contrast. In languages such as English and Dutch, on the other hand, identifying the downstep trigger is more difficult and requires an understanding of the intonational system as a whole. In this case, when we run into difficulties in the phonological characterization of downstep, it is often a useful tactic to wonder whether we have the best possible intonational analysis. Thus, given the admitted awkwardness of their characterization of downstep as a phrasal feature, we might ask whether van den Berg, Gussenhoven, and Rietveld (this volume) have found the optimal tonal analysis of Dutch intonation contours. We are particularly inclined to ask the question here because of the striking resemblance to difficulties that have been encountered in some analyses of English. For example, in translating older British transcription systems into autosegmental conventions, Ladd uses H* + L for the representation of a falling Fo around the nuclear accent, precluding its use for a prenuclear H* + L which contrasts with H* phonologically primarily in its triggering of downstep. This analysis led him first (in Ladd 1983a) to a characterization of downstep in English as a paradigmatic feature of H tones, exactly analogous to van den Berg, Gussenhoven, and Rietveld's characterization of downstep in Dutch as a paradigmatic choice for the phrase, specified independently of the choice of tones or pitch accents. In subsequent work, Ladd has recognized that this characterization does not explain the syntagmatic distribution of downstepped tones which is the original motivation for the downstep analysis. He has replaced the featural analysis with the phonological representation of register relationships among pitch accents 388
14 and 15 Comments
using a recursive metrical tree, as in Liberman and Prince's (1977) account of stress patterns or Clements's (1981) account of downdrift in many African tone languages. We feel that a more satisfactory solution is to use H* + L to transcribe the prenuclear tone in many downstepped contours where the L is not realized as an actual dip in Fo before the next accent's peak. This more abstract analysis captures the phonological similarities and the common thread of pragmatic meaning that is shared by all downstepping pitch accents in English, whether they are rising (L + H) or falling (H + L) and whether the following downstepped tone is another pitch accent or is a phrasal H (see Pierrehumbert and Hirschberg, 1990). A similar problem arises in the transcription of boundary tones. Gronnum (this volume) criticizes van den Berg, Gussenhoven, and Rietveld for transcribing the contour in figure 14.1a with a L% boundary tone, on the grounds that there is no marked local lowering of Fo. But not transcribing a L% boundary tone here would mean jettisoning the generalization that all intonation phrases are marked with boundary tones - that figure 14.1a contrasts with the initial portion of contour figure 14.2a as well as with contours with a H% boundary tone, realized as a clear local rise. It would also jettison the cross-language generalization that final lowering - a progressive compression of the pitch range that reaches in from the end to lower all tones in the last half-second or so of the phrase - is typically associated with L phrase tones (see Pierrehumbert and Beckman 1988: ch.8, for a review). Rejecting L% boundary tones in Dutch on the grounds that contours such as figure 14.1a have no more marked local fall than contours like 14.1b puts a high value on superficial similarities between the F o representation and the tonal analysis, at the expense of system symmetry and semantic coherence. On these grounds, we endorse Gussenhoven's (1984, 1988) approach to the analysis of intonation-phrase boundary tones in Dutch, which he and his co-authors assume in their paper in this volume, and we cannot agree with Gronnum's argument against transcribing figure 14.1a with a L% boundary tone. We think that the awkwardness of characterizing downstep in transcription systems that analyze the nuclear fall in English as a H* 4- L pitch accent is symptomatic of a similar confusion between phonetic representation and phonological analysis. As Ladd does for English, Gussenhoven (1988) analyzes nuclear falls in Dutch as H* + L pitch accents. Adopting this analysis, van den Berg, Gussenhoven, and Rietveld transcribe the falling patterns in figures 14.2a, 14.3a, and 14.5b all as H* + L. We wonder whether this analysis yields the right generalizations. Might the first Fo peak and subsequent sharp fall in contour 14.2a be instead a sequence of H* pitch accent followed by a L phrase accent? This alternative analysis would give an explicit intonational mark for the edge of this sort of prosodic constituent, just as in English (see 389
Prosody
Beckman and Pierrehumbert 1986) and in keeping with the accounts of a unified prosodic structure proposed on various grounds by Selkirk (1981), Nespor and Vogel (1986), Beckman (1986), and Pierrehumbert and Beckman (1988). Might the gradually falling slope after the first peak in figure 14.3a be instead an interpolation between a H* pitch accent and the L of a following L + H* accent? This alternative analysis would attribute the contrast between absence of downstep in figure 14.3a and its presence in 14.5b to the singleton versus bitonal pitch accent, just as in similar contours in English (Beckman and Pierrehumbert 1986) and in keeping with the descriptions of other languages where a bitonal pitch accent triggers downstep, such as Tokyo Japanese (see Poser 1984; Pierrehumbert and Beckman 1988; Kubozono, this volume). Because they analyze the falls in figures 14.2a, 14.3a, and 14.5b all as H* + L, van den Berg, Gussenhoven, and Rietveld must account for the differences among the contours instead by operations that are specified independently of the tonal representation. They attribute the step-like character of the fall in 14.5b to the operation of downstep applied as a paradigmatic feature to the whole phrase. And they account for the more gradual nature of the fall in 14.3a by the application of a rule that breaks up the H* + L accent unit to link its second tone to the syllable just before the following nuclear pitch accent. Elsewhere, Gussenhoven (1984, 1988) describes this rightward shift of the second tone as the first step of a two-part Tone Linking Rule, which in its full application would delete the L entirely to create a hat pattern. The partial and complete applications of Tone Linking merge the prenuclear accent phonologically with the nuclear accent, thus giving the sequence a greater informational integrity. Tone Linking and downstep are mutually exclusive operations. In testing for the systemic and semantic coherence of such an intonational analysis, it is a useful strategy to determine exhaustively which patterns and contrasts are predicted to exist and which not to exist. One can then systematically determine whether the ones that are predicted to exist do contrast with each other and are interpretable in the expected way. Furthermore, one can synthetically produce the ones that are predicted not to exist and see where they are judged to be ill-formed or are reinterpreted as some phonetically similar existing pattern. Among other things, Gussenhoven's analysis of Dutch predicts that two sequences of prenuclear and nuclear accent which contrast only in whether Tone Linking applied partially or completely should not contrast categorically in the way that contours like figure 14.3a and the hat pattern do in English. Also, the description of Tone Linking implies that the operation does not apply between two prenuclear accents, so that three-accent phrases are predicted to have a smaller inventory of patterns than do two-accent phrases. That is, sequences of three accents within a single phrase should always be downstepped rather than 390
14 and 15 Comments
Tone-linked. More crucially, the analysis predicts the impossibility of threeaccent phrases in which only one of the accents triggers a following downstep. As Pierrehumbert (1980) points out, the existence of such mixed cases in English precludes an analysis of downstep as an operational feature of the phrase as a whole. In general, if downstep is to be differentiated from other components of downtrend, we need to be careful of analyses that make downstep look like the paradigmatic choice of whether to apply a certain amount of final lowering to a phrase. A second tactical point relating to the categorical phonological representation of downstep is that in looking at the influence on downstep of syntax or pragmatic focus, one needs first to know whether or not downstep has occurred, and in order to know this, it is imperative to design the corpus so that it contrasts the presence of the downstep trigger with its absence. That is, in order to claim that downstep has occurred, one cannot simply show that a following peak is lower than an earlier peak; one must demonstrate that the relationship between the two peaks is different from that between comparable peaks in an utterance of a form without the phonological configuration that triggers downstep. Kubozono reminds us of this in his paper, and it is a point well worth repeating. Using this tactic can only bring us closer to a correct understanding of the relationship between syntactic and prosodic constituents, including the domain of downstep. A third tactical point is always to remember that other things which superficially look like downstep in the ways in which they affect pitch range do not necessarily function linguistically like downstep. For example, it is generally agreed that downstep has a domain beyond which some sort of pitch-range reset applies. Since the sorts of things that produce reset seem to be just like the things that determine stress relationships postlexically syntactic organization and pragmatic focus and the like - it is very easy to assume that this reset will be syntagmatic in the same way that downstep is. Thus, van den Berg, Gussenhoven, and Rietveld list in their paper only these two possible characterizations: 1 The reset is a syntagmatic boost that locally undoes downstep. 2 The reset is a syntagmatic register shift by a shift factor that either (a) reverses the cumulative effects of downstep within the last domain; or (b) is an independent "phrasal downstep" parameter. They do not consider a third characterization, suggested by Liberman and Pierrehumbert (1984) and developed in more detail by Pierrehumbert and Beckman (1988): 3 The reset is a paradigmatic choice of pitch range for the new phrase. In this last characterization, the appearance of phrasal downstep in many 391
Prosody
experiments would be due to the typical choice of a lower pitch range for the second phrase of the utterance, reflecting the discourse structure of the mini-paragraph. A criticism that has been raised against our characterization in point 3 is that is introduces too many degrees of freedom. Ladd (1990), for example, has proposed that instead of independent paradigmatic choices of pitch range for each phrase and of tonal prominence for each accent, there is only the limited choice of relative pitch registers that can be represented in binary branching trees. Kubozono in his paper finds this view attractive and adapts it to the specification of pitch registers for Japanese major and minor phrases. Such a phonological characterization may seem in keeping with results of experiments such as the one that Liberman and Pierrehumbert (1984) describe, where they had subjects produce sentences with two intonation phrases that were answer and background clauses, and found a very regular relationship in the heights of the two nuclear accent peaks across ten different levels of overall vocal effort. Indeed, results such as these are so reminiscent of the preservation of stress relationships under embedding that it is easy to see why Ladd wants to attribute the invariant relationship to a syntagmatic phonological constraint on the pitch range values themselves, rather than to the constant relative pragmatic saliences. However, when we consider more closely the circumstances of such results, this criticism is called into question. The design of Liberman and Pierrehumbert's (1984) experiment is typical in that it encouraged the subjects to zero in on a certain fixed pragmatic relationship - in that case, the relationship of an answer focus to a background focus. The constant relationship between the nuclear peak heights for these two foci may well reflect the subject's uniform strategy for realizing this constant pragmatic relationship. In order to demonstrate a syntagmatic phonological constraint, we would need to show that the peak relationships are constant even when we vary the absolute pragmatic salience of one of the focused elements. The analogy to stress relationships also fails under closer examination in that the purely syntagmatic characterization of stress is true only in the abstract. A relational representation of a stress pattern predicts any number of surface realizations, involving many paradigmatic choices of different prominence-lending phonological and phonetic features. For example, the relatively stronger second syllable of red roses relative to the first might be realized by the greater prominence of a nuclear accent relative to a prenuclear accent (typical of the citation form), or by a bigger pitch range for the second of two nuclear accents (as in a particularly emphatic pronunciation that breaks the noun phrase into two intermediate phrases), or by the greater prominence of a prenuclear accent relative to no accent (as in a possible pronunciation of the sentence The florist's red roses are more expensive). 392
14 and 15 Comments
Similarly, a weak-strong pragmatic relationship for the two nouns in Anna came with Manny can be realized as a particular choice of pitch ranges for two intonational phrases, or as the relative prominence of prenuclear versus nuclear pitch accent if the speaker chooses to produce the sentence as one intontation phrase. As Jackendoff (1972), Carlson (1983), and others have pointed out, the utterance in this case has a somewhat different pragmatic interpretation due to Anna's not being a focus, although Anna still is less salient pragmatically than Manny. The possibility of producing either two-foci or one-focus renditions of this sentence raises an important strategic issue. Liberman and Pierrehumbert (1984) elicited two-foci productions by constructing a suitable context frame and by pointing out the precise pragmatic interpretation while demonstrating the desired intonation pattern. If they had not taken care to do this, some of their subjects might have given the other interpretation and produced the other intonation for this sentence, making impossible the desired comparison of the two nuclear-accent peaks. A more typical method in experiments on pitch range, however, is to present the subject with a randomized list of sentences to read without providing explicit cues to the desired pragmatic and intonational interpretation. In this case, the subject will surely invent an appropriate pragmatic context, which may vary from experiment to experiment or from utterance to utterance in uncontrolled ways. The effects of this uncontrolled variation is to have an uncontrolled influence on the phrasal pitch ranges and on the prominences of pitch accents within a pitch range. The variability of results concerning the interaction of syntax and downstep in the literature on Japanese (e.g., Kubozono 1989, this volume; Selkirk 1990; Selkirk and Tateishi, 1990) may reflect this lack of control more than it does anything about the interaction between syntactic constituency and downstep. The fact that a sentence can have more than one pragmatic interpretation also raises a methodological point about statistics: before we can use averages to summarize data, we need to be sure that the samples over which we are averaging are homogeneous. For example, both in Poser (1984) and in Pierrehumbert and Beckman (1988), there were experimental results which could be interpreted as showing that pragmatic focus reduces but does not block downstep. When we looked at our own data more closely, however, we found that the apparent lesser downstep was actually the result of a single outlier in which the phrasing was somewhat different and downstep had occurred. Including this token in the average made it appear as if the downstep factor could be chosen paradigmatically in order to give greater pitch height than normal to prosodic constituents with narrow focus. The unaveraged data, however, showed that the interaction is less direct; elements bearing narrow focus tend to be prosodically separated from preceding 393
Prosody
elements and thus are realized in pitch ranges that have not been downstepped relative to the pitch range of preceding material. It is possible that Kubozono could resolve some of the apparent contradictions among his present results and those of other experiments in the literature on Japanese if he could find appropriate ways of looking at all of the data token by token. The specific tactical lesson to draw here is that since our understanding of pragmatic structure and its relationship to phrasing and tone-scaling is not as well developed as our understanding of phonological structure and its interpretation in Fo variation, we need to be very cautious about claiming from averaged data that downstep occurs to a greater or lesser degree in some syntactic or rhythmic context. A more general tactical lesson is that we need to be very ingenious in designing our experiments so as to elicit productions from our subjects that control all of the relevant parameters. A major strategic lesson is that we cannot afford to ignore the knotty questions of semantic and pragmatic representation that are now puzzling linguists who work in those areas. Indeed, it is possible that any knowledge concerning prosodic structure and prominence that we can bring to these questions may advance the investigative endeavor in previously unimagined ways. Returning now to assumptions about control mechanism, there is another topic touched on in the paper by van den Berg, Gussenhoven, and Rietveld which also raises a strategic issue of major importance to future progress in our understanding of tone and intonation. This is the question of how to model L tones and the bottom of the pitch range. Modeling the behavior of tones in the upper part of the pitch range - the H tones - is one of the success stories of laboratory phonology. The continuing controversies about many details of our understanding (evident in the two papers in this section) should not be allowed to obscure the broad successes. These include the fact that [ + H] is the best understood distinctive-feature value. While work in speech acoustics has made stunning progress in relating segmental distinctive features to dimensions of articulatory control and acoustic variation, the exact values along these dimensions which a segment will assume in any given context in running speech have not been very accurately modeled. In contrast, a number of different approaches to H-tone scaling have given rise to Fo synthesis programs which can generate quite accurately the contours found in natural speech. Also, work on H-tone scaling has greatly clarified the division of labor between phonology and phonetics. In general, it has indicated that surface phonological representations are more abstract than was previously supposed, and that much of the burden of describing sound patterns falls on phonetic implementation rules, which relate surface phonological representations to the physical descriptions of speech. Moreover, attempts to formulate such rules from the results of appropriately designed experiments have yielded insights into the role of prosodic structure in speech 394
14 and 15 Comments
production. They have provided additional support for hierarchical structures in phonology, which now stand supported from both the phonetic and morphological sides, a fate we might wish on more aspects of phonological theory. In view of these successes, it is tempting to tackle L tones with the same method that worked so well for H tones - namely, algebraic modeling of the Fo values measured in controlled contexts. Questions suggested under this approach include: What functions describe the effects of overall pitch range and local prominence on Fo targets for L tones? What prevents L tones from assuming values lower than the baseline? In downstep situations, is the behavior of L tones tied to that of H tones, and if so, by what function? We think it is important to examine the assumptions that underlie these questions, particularly the assumptions about control mechanisms. We suggest that it would be a strategic error to apply too narrowly the precedents of work on H-tone scaling. Looking first at the physiological control, we see that L-tone scaling is different from H-tone scaling. A single dominant mechanism, cricothyroid contraction, appears to be responsible for H tone production, in the sense that this is the main muscle showing activity when F o rises into a H tone. In contrast, no dominant mechanism for L tone production has been found. Possible mechanisms include: Cricothyroid relaxation - e.g., Simada and Hirose (1971), looking at the production of the initial boundary L in Tokyo Japanese; Sagart et al. (1986), looking at the fourth (falling) tone in Mandarin. Reduction of subglottal pressure - e.g., Monsen, Engebretson, and Vermula, (1978), comparing L and H boundary tones. Strap muscle contraction - e.g., Erickson (1976), looking at L tones in Thai; Sagart et al. (1986), looking at the third (low) tone in Mandarin; Sugito and Hirose (1978), looking at the initial L in L-initial words and the accent L in Osaka Japanese; Simada and Hirose (1971) and Sawashima, Kakita, and Hiki (1973), looking at the accent L in Tokyo Japanese. Cricopharyngeus contraction - Honda and Fujimura (1989), looking at L phrase accents in English.
Some of these mechanisms involve active contraction, whereas others involve passive relaxation. There is some evidence that the active gesture of strap muscle contraction comes into play only for L tones produced very low in the pitch range. For example, of the four Mandarin tones, only the L of the third tone seems to show sternohyoid activity consistently (see Sagart et al. 1986). Similarly, the first syllable of L-initial words in Osaka Japanese shows a marked sternohyoid activity (see Sugito and Hirose 1978) that is not usually observed in the higher L boundary tone at the beginning of Tokyo Japanese 395
Prosody
accentual phrases (see, e.g., Simada and Hirose 1971). Lacking systematic work on the relation of the different mechanisms to different linguistic categories, we must entertain the possibility that no single function controls L-tone scaling. Transitions from L to H tones may bring in several mechanisms in sequence, as suggested in Pierrehumbert and Beckman (1988). One of the tactical imports of the different mechanisms is that we need to be more aware of the physiological constraints on transition shape between tones; we should not simply adopt the most convenient mathematical functions that served us so well in H-tone scaling models. Another common assumption that we must question concerns the functional control of the bottom of the pitch range. We need to ask afresh the question: Is there a baseline? Does the lowest measured value at the end of an utterance really reflect a constant floor for the speaker, which controls the scaling of tones above it, and beyond which the speaker does not aim to produce nor the hearer perceive any L tone? Tone-scaling models have parlayed a great deal from assuming a baseline, on the basis of the common observation that utterance-final L values are stable for each speaker, regardless of pitch range. On the other hand, it is not clear how to reconcile this assumption with the observation that nuclear L tones in English go up with voice level (see e.g. Pierrehumbert, 1989). This anomaly is perturbing because it is crucial that we have accurate measures of the L tones; estimates of the baseline from H-tone scaling are quite unstable in the sense that different assumptions about the effective floor can yield equally good model fits to H-tone data alone. The assumption that the bottom of the pitch range is controlled via a fixed baseline comes under further suspicion when we consider that the last measure Fo value can be at different places in the phrasal contour, depending on whether and where the speaker breaks into vocal fry or some other aperiodic mode of vocal-fold vibration. It is very possible that the region past this point is intended as, and perceived as, lower than the last point where Fo measurement is possible. A third assumption that relates to both the physiological and the functional control of L tones concerns the nature of overall pitch-range variation. It has been easy to assume in H-tone modeling that this variation essentially involves the control of Fo. Patterns at the top of the range have proved remarkably stable at the different levels of overall Fo obtained in experiments, allowing the phenomenon to be described with only one or two model parameters. We note, however, that the different Fo levels typically are elicited by instructing the subject to "speak up" to varying degrees. This is really a variation of overall voice effort, involving both an increased subglottal pressure and a more pressed vocal-fold configuration. It seems likely, therefore, that the actual control strategy is more complicated than our H396
14 and 15 Comments
tone models make it. While the strategy for controlling overall pitch range interacts with the physiological control of H tones in apparently simple ways, its interaction with the possibly less uniform control mechanism for L tones may yield more complicated Fo patterns. In order to find the invariants in this interaction, we will probably have to obtain other acoustic measures besides Fo to examine the other physiological correlates of increased pitch range besides the increased rate of vocal-fold vibration. Also, it may be that pitchrange variation is not as uniform functionally as the H-tone results suggest. It is possible that somewhat different instructions to the subject or somewhat different pragmatic contexts will emphasize other aspects of the control strategy, yielding different consequences for Fo variation, particularly at the bottom of the pitch range. These questions about L-tone scaling have a broader implication for research strategy. Work on H tones has brought home to us several important strategic lessons: experimental designs should orthogonally vary local and phrasal properties; productions should be properly analyzed phonologically; and data analysis should seek parallel patterns within data separated by speaker. We clearly need to apply these lessons in collecting Fo measurements for L tones. However, to understand fully L tones, we will need something more. We will need more work relating linguistic to articulatory parameters. We will need to do physiological experiments in which we fully control the phonological structure of utterances we elicit, and we will need to develop acoustic measures that will help to segregate the articulatory dimensions in large numbers of utterances.
397
i6
Secondary stress: evidence from Modern Greek AMALIA ARVANITI
16.1 Introduction
The need to express formally stress subordination in English has always been felt and many attempts to do so have been made, e.g. Trager and Smith (1951), Chomsky and Halle (1968). However, until the advent of metrical phonology (Liberman and Prince 1977) all models tried to express stress subordination through linear analyses. The great advantage of metrical phonology is that by presenting stress subordination through a hierarchical structure it captures the difference in stress values between successive stresses in an economical and efficient way. When Liberman and Prince presented their model, one of their intentions was to put forward a "formalization of the traditional idea of'stress timing'" (1977: 250) through the use of the metrical grid. This reference to stresstiming implies that their analysis mainly referred to the rhythm of English. However, the principles of metrical phonology have been adopted for the rhythmic description of other languages (Hayes 1981; Hayes and Puppel 1985; Roca 1986), including Polish and Spanish, which are rhythmically different from English. The assumption behind studies like Hayes (1981) is that, by showing that many languages follow the same rhythmic principles as English, it can be proved that the principles of metrical phonology, namely binarity of rhythmic patterns and by consequence hierarchical structure, are universal. However, such evidence cannot adequately prove the universality of metrical principles; what is needed is evidence that there are no languages which do not conform to these principles. Thus, it would be interesting to study a language that does not seem to exhibit a strictly hierarchical, binary rhythmic structure. If the study of such a language proves this to be the case, then the claim for the universality of binary rhythm may have to be revised. One language that seems to show a markedly different kind of rhythmic patterning from English is Modern Greek. 398
16 Amalia Arvaniti
In fact, the past decade has seen the appearance of a number of studies of Modern Greek prosody both in phonology (Malikouti-Drachman and Drachman 1980; Nespor and Vogel 1986, 1989; Berendsen, 1986) and in phonetics (Dauer 1980; Fourakis 1986; Botinis 1989). These studies show substantial disagreement concerning the existence and role of secondary stress in Greek. By way of introduction I present a few essential and undisputed facts about Greek stress. First, in Greek, lexical stress conforms to a Stress Wellformedness Condition (henceforth SWFC), which allows lexical stress on any one of the last three syllables of a word but no further to the left (Joseph and Warburton 1987; Malikouti-Drachman and Drachman 1980). Because of the SWFC, lexical stress moves one syllable to the right of its original position when affixation results in the stress being more than three syllables from the end of the word; e.g. (1) /'ma0ima/ "lesson" > /'maGima-l-ta/ "lesson-hs" > /ma'Oimata/ "lessons" Second, as can be seen from example (1), lexical stress placement may depend on morphological factors, but it cannot be predicted by a word's metrical structure because there are no phonological weight distinctions either among the Greek vowels, /i, e, a, o, u/, or among syllables of different structure; i.e. in Geek, all syllables are of equal phonological weight. Therefore, it is quite common for Greek words with the same segmental structure to have stress on different syllables; e.g. (2) a. b.
/'xo.ros/ "space" /xo.'ros/ "dance" (noun)
It is equally possible to find words like those in (3), (3) a. b.
/'pli.6os/ "crowd" /'plin.Gos/ "brick"
where both words are stressed on their first syllable, although this is open in (3a) and closed in (3b). Finally, when the SWFC is violated by the addition of an enclitic to a host stressed on the antepenultimate, a stress is added two syllables to the right of the lexical stress. For example, (4) (5)
/'maBima tu/>/'ma0i'ma tu/ "his lesson" /'5ose mu to/>/'6ose 'mu to/ "give it to me"
All investigators (Setatos 1974; Malikouti-Drachman and Drachman 1980; Joseph and Warburton 1987; Botinis 1989) accept that the two stresses in the host-and-clitic group have different prominence values. However, not all of them agree as to the relative prominence of the two stresses. Most 399
Prosody
investigators (Dauer 1980; Malikouti-Drachman and Drachman, 1980; Joseph and Warburton 1987) agree that the added stress is stronger than the host's lexical stress. Setatos (1974), however, followed by Nespor and Vogel (1986, 1989), claims that the host's lexical stress remains the strongest. Thus, a first point of disagreement emerges: namely, the prominence value of the SWFC-induced stress. Botinis (1989) presents an entirely different analysis: influenced perhaps by work on Swedish prosody, he claims that the SWFC-induced stress of a hostand-clitic group is a "phrase stress." Botinis admits that "on acoustic grounds it is questionable if there is enough evidence to differentiate between word and phrase stress although they have quite different perceptual dimensions" (1989: 85). However, Botinis's claim that word and phrase stress are perceptually distinct could be attributed partly to incorrect manipulation of the Fo contour in the synthesized stimuli of his perceptual experiment and partly to incorrect interpretation of this experiment's results (for details see Arvaniti 1990). In fact, Botinis's evidence suggests that the SWFC-induced stress is acoustically the most prominent in the host-and-clitic group. Most of the studies (Dauer 1980; Fourakis 1986; Joseph and Warburton 1986) mention stress subordination only in relation to the host-and-clitic group stress addition, in which case they refer to "secondary stress." In other words, in most of the studies it is assumed that in Greek each word normally carries only lexical stress. Phonological studies (Malikouti-Drachman and Drachman 1980; Nespor and Vogel 1986, 1989), though, assume that, in addition to lexical stress, Greek exhibits rhythmic stresses which are added at the surface level. Thus, the presence of rhythmic stresses is the second point of contention among studies. Nespor and Vogel and Malikouti-Drachman and Drachman relate rhythmic stress to the "secondary" stress of host-andclitic groups but in two different ways. Nespor and Vogel, on the one hand, propose that rhythm is represented by the grid which is built "on the basis of the prosodic structure of a given string" (1989: 70). The grid shows prominence relations among stresses but it cannot show constituency. Nespor and Vogel (1989) suggest that in Greek rhythmic stresses appear only when there is a lapse in the first level of the grid; in other words, whereas a series of unstressed syllables constitutes a lapse that can trigger rhythmic stress, a series of lexical stresses of equal prominence does not constitute a lapse. When a lapse occurs, one of the syllables that has only one asterisk in the grid acquires a second asterisk, i.e. a rhythmic stress, through the beat addition rule. The placement of rhythmic stresses is regulated by two preference rules. As has been mentioned, Nespor and Vogel (1986, 1989) also maintain that the SWFC-induced stress (or "secondary stress," as they call it) is less prominent than the original lexical stress of the host; this "secondary stress" is of equal prominence to a 400
16 Amalia Arvaniti
rhythmic stress. According to Nespor and Vogel (1989) the difference between a SWFC-induced stress and a rhythmic stress lies in the fact that the former is the result of an obligatory prosodic rule which operates within C (clitic group) while the latter is the result of Beat Addition, an optional rhythmic rule which operates in the grid. Examples from Nespor (forthcoming) and Nespor and Vogel (1989) suggest that rhythmic stress and "secondary" stress have the same phonetic realization since (a) they are both represented by two asterisks in the grid and (b) the grid cannot show constituency differences (i.e. the fact that "secondary stress" belongs to C). However, if this were correct then the following examples [o 'daska.los tu] c 1 ['anikse tin 'porta] c "His teacher opened the door" [o '5askalos] c [tu 'anikse tin 'porta] c "The teacher opened the door to him"
(6) (7)
could have the same rhythmic structure since the lapse in (7), i.e. the series of unstressed syllables /skalos tu/, can only be remedied by adding a rhythmic stress on /los/. This is not the case, however; the two examples are clearly differentiated in Greek. Indeed, the subjects who took part in the perceptual experiments reported in Botinis (1989) could distinguish even synthesized stimuli of similar structures to (6) and (7) in at least 91 percent of the cases. Thus, one of the two claims of Nespor and Vogel (1989) must be incorrect: if /los/ in (6) carries "secondary" stress then "secondary" and rhythmic stress must be presented in different ways in the grid, or else /los/ in (6) carries the main stress of the host-and-clitic group, not a "secondary" one. As has been mentioned, Malikouti-Drachman and Drachman (1980) also relate secondary and rhythmic stresses, but their approach differs from that of Nespor and Vogel, in that the former assume that the secondary stress in a host-and-clitic group is the weakened lexical stress of the host. Rhythmic stresses are added following the Rhythm Rule which states "Make a trochaic foot of any adjacent pair of weak syllables to the left of the lexical stress within the word [word + clitics] (iterative)" (1980: 284). (8)
w
w
w
s
w
6er
fi
mu
A >
i
a
5er
A fi
mu
"my sister"
The stresses on /daskalos/ are presented here with the prominence values assumed by Nespor and Vogel (1989). 401
Prosody (9)
i
e
pav
li
mas
>
i
e
pav
li
mas
"our villa"
As can be seen from examples (8) and (9), the Rhythm Rule applies not only to the left but also to the right of the lexical stress; this "refooting" explains the SWFC-induced stress. However, by equating the metrical structures of (8) and (9), this analysis cannot differentiate between a metrical tree with an optional rhythm-induced stress like (8), and a tree with an obligatory SWFC-induced stress like (9). To summarize, there seem to be two main interconnected issues that are addressed by researchers: namely, the presence and nature of rhythmic and SWFC-induced stress. In brief, Nespor and Vogel (1986, 1989) and Malikouti-Drachman and Drachman (1980) alone agree that Greek exhibits rhythmic stress; and although they disagree as to which of the two stresses in a host-and-clitic group is the most prominent, they agree that the weaker one of the two is identical to rhythmic stress. Botinis (1989), on the other hand, does not mention rhythmic stresses but he proposes two distinct prosodic categories, i.e. word and phrase stress, to account for the SWFC-induced stress in host-and-clitic groups. The present paper is an attempt to examine these issues using acoustical and perceptual evidence rather than impressionistic data. First, two questions must be answered: (a) whether the stress added to a host-and-clitic group due to SWFC violation is the most prominent in the group; (b) whether this added stress is perceptually distinct from a lexical stress. When answers to these questions are established two more questions can be addressed: (c) whether "secondary" stress and rhythmic stress are perceptually and acoustically the same or not; (d) whether there is any evidence for rhythmic stress. The questions are investigated by means of two perceptual tests, and acoustic analyses of the data of the second experiment. Finally, an attempt is made to present the results formally within a broadly conceived metrical framework. 16.2 Experiment 1 16.2.1 Method 16.2.1.1 Material The first experiment is a simple perceptual test whose aim is to see first whether the lexical stress of the host-and-clitic group is more prominent than 402
16 Amalia Arvaniti
Table 16.1 One of the two test pairs (la and lb) and one of the distractors (2a and 2b) in the context in which they were read. The test phrases and distractors are in bold type. 1 (a) /tu 'ipa to yi'a arista su ke 'xarike po'li/ "I told him about your 1st class mark and he was very pleased" (b) /e'yo tu fo'nazo 'ari stasu ki a'ftos 6e stama'tai/ "I shout at him Ari stop but he doesn't stop" 2 (a) /pi'stevo 'oti 'ksero to 'mono 'loyo yi'a a'fti tin ka'tastasi/ "I believe that I know the only reason for this situation" (b) /den 'exo a'kusi pi'o vare'to mo'noloyo sto '0eatro/ "I haven't listened to a more boring theatrical monologue" the SWFC-induced one as Setatos (1974) and Nespor and Vogel (1986, 1989) maintain, and second whether Botinis's phrase stress and word stress are two perceptually distinct stress categories as his analysis suggests. Two test pairs were designed (see the parts of table 16.1, la and lb, in bold type): in each test pair the two members are segmentally identical but have word boundaries at different places and are orthographically distinct in Greek. The first member, (a), of each pair consists of one word stressed on the antepenultimate and followed by an enclitic possessive pronoun. As this pattern violates the SWFC a stress is added on the last syllable of the word. The second member, (b), consists of two words which together form a phrase and which are stressed on the same syllables as member (a). Thus, the difference between members (a) and (b) of each test pair is that in (a) the phrase contains a lexical and a SWFC-induced stress whereas in (b) each one of the two words carries lexical stress on the same syllables as (a). According to Nespor and Vogel, the most prominent stress in (a) phrases is the lexical stress of the host while in (b) phrases it is the stress of the second word (i.e. the one that falls on the same syllable as the SWFC-induced stress in (a)) since the second word is the head of the phonological phrase O (1986: 168). Also, in Botinis's terms (a) and (b) phrases have different stress patterns, (a) containing one word and one phrase stress and (b) containing two word stresses; these stress patterns are said by Botinis to be perceptually distinct. If either Nespor and Vogel or Botinis are correct, (a) and (b) phrases should be distinguishable. The test phrases were incorporated into meaningful sentences (see table 16.1). Care was taken to avoid stress clashes, and to design, for each pair, sentences of similar prosodic structure and length. Two distractor pairs were devised on the same principle as the test pairs (see table 16.1, 2a and 2b). The difference is that in the distractors one member contains two words, each one 403
Prosody
with its own lexical stress (/'mono 'loyo/"only reason"), while in the other member the same sequence of syllables makes one word with lexical stress on a different syllable from those stressed in the first member (/mo' noloyo/"monologue"). The sentences were read by four subjects including the author. Each subject read the test sentences and the distractors six times from a randomized list, typed in Greek. The recorded sentences and the distractors were digitized at 16 kHz and then were edited so that only the test phrases and distractors were left. For each test phrase and distractor one token from each one of the four subjects was selected for the test tape. The tokens chosen were, according to the author's judgment, those that sounded most natural by showing minimum coarticulatory interference from the carrier phrase. To make the listening tape, the test phrases and the distractors were recorded at a sampling rate of 16 kHz using computer-generated randomization by blocks so that each token from each subject was heard twice. Each test phrase and distractor was preceded by a warning tone. There were 100 msec, of silence between the tone and the following phrase and 2 sec. between each stimulus and the following tone. Every twenty stimuli there was a 5 sec. pause. In order for listeners to familiarize themselves with the task, the first four stimuli were repeated at the end of the tape, and the first four responses of each listener were discarded. Thus, each subject heard a total of seventy stimuli: 4 speakers x (4 test phrases + 4 distractors) x 2 blocks + 4 repeated items -I- 2 stimuli that consisted of two tones each (a result of the randomization program). 16.2.1.2 Subjects As mentioned, four subjects took part in the recording. Three of them (two female, one male) were in their twenties and they were postgraduate students at the University of Cambridge. The fourth subject was a sixty-year-old woman visiting Cambridge. All subjects were native speakers of Greek and spoke the standard dialect. All, apart from the fourth subject, had extensive knowledge of English. None of the subjects had any history of speech or hearing problems. Apart from the author all subjects were naive as to the purpose of the experiment. Eighteen subjects (seven male and eleven female) did the perceptual test. They were all native speakers of Greek and had no history of speech or hearing problems. Twelve of them were between 25 and 40 years old and the other six were between 40 and 60 years old. Fourteen of them spoke other languages in addition to Greek but only one had extensive knowledge and contact with a foreign language (English). All subjects had at least secondary education and fourteen of them held university degrees. All subjects spoke 404
16 Amalia Arvaniti
Standard Greek, as spoken in Athens, where sixteen of them live. They were all naive as to the purposes of the experiment. 16.2.13 Procedure The subjects did the test in fairly quiet conditions using headphones and a portable Sony Stereo Cassette-Corder TCS-450. No subject complained that their performance might have been marred by noise or poor-quality equipment. The subjects were given a response sheet, typed in Greek, which gave both possible interpretations of every stimulus in the tape (70 x 2 possible answers). The task was explained to them and they were urged to give an answer to all stimuli even if they were not absolutely certain of their answer. The subjects were not allowed to play back the tape. 16.2.2 Results The subjects gave a total of 576 responses excluding the distractors (18 subjects x 32 test phrases/answer sheet). There were 290 mistakes, i.e. 50.34 percent of the responses to the test phrases were wrong (identification rate 49.66 percent). The number of mistakes ranged from a minimum of nine (one subject) to twenty-one (one subject). By contrast, the identification rate of the distractors was 99.1 percent; out of 18 subjects only two made one and four mistakes respectively. Most subjects admitted that they could not tell the test phrases apart although they found the distractors easy to distinguish. Even the subjects who insisted that they could tell apart the test pairs made as many mistakes as the rest. Thus the results of experiment 1 give an answer to the first two questions addressed here. The results clearly indicate (see table 16.2) that, contrary to Setatos (1974) and Nespor and Vogel (1986, 1989), the SWFC-induced stress is the most prominent in the host-and-clitic group, whereas the original lexical stress of the host weakens. This weakening is similar to that of the lexical stress of a word which is part of a bigger prosodic constituent, such as a O, without being its head. Also, the results show that in natural speech Botinis's "phrase stress" is not perceptually distinct from word stress as he suggests. 16.3 Experiment 2
16.3.1 Method 16.3.1.1 Material Experiment 2 includes a perceptual test and acoustical analyses of the utterances used for it. With the answers to questions (a) and (b) established, this experiment aims at answering the third question addressed here: namely, whether rhythmic stress and the weakened lexical stress (or "secondary 405
Prosody Table 16.2 Experiment 1: contingency table of type of stimulus by subject response. Type of stimulus Response
"1 word" stimulus
"2 word" stimulus
Total
(a) Observed responses "1 word" "2 word" Total
113 175 288
115 173 288
228 348 576
(b) Expected responses (and deviances) 114(0.008) "1 word" 174 (0.005) "2 word"
114(0.008) 174 (0.005)
Note: Total deviance (%2) = 0.026 ldf. The difference between the relevant conditions is not significant.
stress") of the host in a host-and-clitic group are perceptually and acoustically the same as Malikouti-Drachman and Drachman suggest. In addition, this experiment is an attempt to find acoustical evidence for rhythmic stress. Four pairs of segmentally identical words with different spelling and stress patterns were chosen (see the parts of table 16.3 in bold type). One word in each pair has lexical stress on the antepenultimate - (la) in table 16.3 - and the other on the last syllable - (lb) in table 16.3. These pairs were incorporated into segmentally identical, but orthographically distinct sentences in which they were followed by a possessive enclitic (see table 16.3). For clarity, (a) test words will be referred to as SS (for "secondary stress") and (b) words as RS (for rhythmic stress), thus reflecting the terms used by various analyses but not necessarily the author's opinion on the nature of stress in Greek. These two terms will be used throughout with the same caution. The addition of the enclitic results in a change in the stress pattern of SS words as required by the SWFC. Thus, these words have a "secondary stress" on their antepenultimate syllable (i.e. the weakened lexical stress) and primary stress (i.e. the added stress) on their last syllable. According to Malikouti-Drachman and Drachman, RS words also have this stress pattern since (a) polysyllabic words with final stress carry rhythmic stress on their antepenultimate syllable and (b) rhythmic stress and "secondary stress" are not distinguished. If their claims are correct, then the SS and RS words of each test pair are segmentally and metrically identical and therefore indistinguishable. Four pairs of distractors incorporated into identical sentences were also included (see table 16.3, 2a and 2b). These were devised on the same pattern 406
16 Amalia Arvaniti
Table 16.3 One of the test sentence pairs and one of the distractor sentence pairs of experiment 2. The test words and distractors are in bold type. 1 (a) / mu 'eleye 'oti 'vriski ton .eni'ko tis po'li enoxliti'ko/ "S/he was telling me that s/he finds her tenant very annoying" (b) / mu 'eleye 'oti 'vriski ton eni'ko tis po'li enoxliti'ko/ "S/he was telling me that s/he finds her "singular" [nonuse of politeness forms] very annoying" 2 (a) /i a'poxi tus 'itan po'li me'yali/ "Their hunting-net was very big" (b) /i apo'xi tus 'itan po'li me'yali/ "Their abstention was very big" as the test sentences, the difference being that the word pairs in the distractors differed in the position of the primary stress only; /a'poxi/ "hunting net": /apo'xi/ "abstention". The sentences were read by six subjects, in conditions similar to those described for experiment 1, from a typed randomized list which included six repetitions of each test sentence and distractor. The first two subjects (El and MK) read the sentences from hand-written cards three times each. A listening tape was made in the same way as in experiment 1. The stimuli were the whole sentences not just the test word pairs. The tape contained one token of each test sentence and distractor elicited from each subject. There were 3 sec. of silence between sentences and 5 sec. after every tenth sentence. Thefirstfour sentences were repeated at the end of the tape, and the first four responses of each listener were discarded. Each subject heard a total of 100 sentences: 6 speakers x (8 test sentences + 8 distractors) + 4 repeated stimuli. 16.3.1.2 Perceptual experiment The same speakers that recorded the material of experiment 1 did the recording of experiment 2. In addition, two more female subjects (El and MK) of similar age and education as three of the subjects of experiment 1 took part in the recording. All the subjects that took part in perceptual test 1 performed test 2 as well. The responses of one of the subjects who did not understand what she was asked to do and left most test pairs unmarked were discarded. The procedure was the same as that described in experiment 1. The answer sheet gave 200 possible answers i.e. 100 stimuli x 2 alternatives. 16.3.1.3 Acoustical analyses All three tokens of each test sentence of the original recording of El and 407
Prosody
MK and the first three tokens of HP's recording were digitized at a sampling rate of 16 kHz and measurements of duration, amplitude, and F o were obtained. Comparisons of the antepenultimate and final syllables of the SS words with the equivalent syllables of the RS words are presented here (see figures 16.1-16.4, below). For instance, the duration, Fo and amplitude values of/e/ in SS word /.eni'ko/ "tenant" were compared to those of/e/ in RS word /eni'ko/ "singular." Duration was measured from spectrograms. The error range was one pitch period (about 4-5 msec, as all three subjects were female). Measurements followed common criteria of segmentation (see Peterson and Lehiste 1960). VOT was measured as part of the following vowel. Three different measurements of amplitude were obtained: peak amplitude (PA), root mean square (RMS) amplitude, and amplitude integral (AI). All data have been normalized so as to avoid statistical artifacts due to accidental changes such as a subject's leaning towards the microphone etc. To achieve normalization, the PA of each syllable was divided by the highest PA in the word in question while the RMS and AI of each syllable was divided by the word's RMS and AI respectively; thus the RMS and AI of each syllable are presented as percentages of the word's RMS and AI respectively. All results refer to the normalized data. All original measurements were in arbitrary units given by the signal processing package used. For peak amplitude, measurements were made from waveforms at the point of highest amplitude of each syllable nucleus. RMS and AI were measured using a computer program which made use of the amplitude information available in the original sample files.2 To calculate the RMS, the amplitude of each point within the range representing the syllable nucleus was squared and the sum of squared amplitudes was divided by the number of points; the square root of this measurement represents the average amplitude of the sound (RMS) and is independent of the sound's duration. AI measurements were obtained by simply calculating the square root of the sum of squared amplitudes of all designated points without dividing the sum by the number of points. In this way, the duration of the sound is taken into account when its amplitude is measured, as a longer sound of lower amplitude can have the same amplitude integral as a shorter sound of higher amplitude. This way of measuring amplitude is based on Beckman (1986); Beckman, indeed, found that there is strong correlation between stress and AI for English. Fundamendal frequency was measured using the Fo tracker facility of a signal processing package (Audlab). To ensure the reliability of the Fo tracks, narrow-band spectrograms were also made and the contour of the 2
I am indebted to Dr. D. Davies and Dr. K. Roussopoulos for writing the program for me. 408
16 Amalia Arvaniti
Table 16.4 Experiment 2: contingency table oftype of stimulus by subject response. Type of stimulus Response
SS stimulus
RS stimulus
Total
(a) Observed responses SS RS Total
402 6 408
17 391 408
419 397 816
(b) Expected responses (and deviances) SS 209.5 (176.87) 198.5 (186.68) RS
209.5 (176.87) 198.5 (186.68)
Note: Total deviance (x2) = 727.10 ldf. The result is significant; p<0.001
harmonics tracked and measured. Discontinuities in the F o tracks were smoothed out by hand to correspond to the contour of the harmonics in the narrow-band spectrograms. No actual measurements of F o are presented here since what is1 essential is the difference between the contours of SS and RS words. 163.2 Results 16.3.2.1 Perceptual experiment The subjects gave a total of 816 responses excluding the distractors (16 subjects x 48 responses/answer sheet). Nine subjects made no mistakes in the test words and the other seven made a total of 23 mistakes; the test's identification rate was 97.2 percent. Of the subjects that made mistakes, five made between 1 and 3 mistakes (2 mistakes on average). Only two subjects made 6 and 7 mistakes respectively. The distractors' identification rate was very similar (98.2 percent). Only four people made 1,2, 3, and 9 mistakes each in the distractors. The persons who made the highest number of mistakes in the test words made the highest number of mistakes in the distractors as well. The results clearly show (see table 16.4) that rhythmic and "secondary stress" can be easily distinguished by native speakers of Greek; thus, it is incorrect to equate them as Malikouti-Drachman and Drachmando. 163.2.2 Acoustical analyses Duration. For each test word the duration of two syllables is presented here, that of the antepenultimate syllable ("secondary" or rhythmic stress) and that of the last syllable (primary stress). Results for all subjects together are shown in figure 16.1. 409
Prosody (a)
300
Msec.
250 200 150 100 50
(b)
Msec. 300 i
250 200 150 100 50
PI
pi
epitropi
Li
XI
li
xi
simetoxi
simvuli
• • Series 1
KO
ko
eniko
E/23 Series 2
Figure 16.1 (a) Means (series 1) and SDs (series 2) of the duration, in msec, of antepenultimate syllables of SS words (left, upper case) and RS words (right, lower case) for all subjects, (b) Same measurements for final syllables
The data from the three subjects are pooled, as t-tests performed on each subject's data separately showed no differences across subjects. One-tailed ttests for the data of all three subjects show that, for antepenultimate syllables, the duration of the antepenult of SS words is significantly longer than that of RS words in all word pairs (see table 16.5 for the t-test results). For vowel durations, one-tailed t-tests show that the duration of the 410
16 Amalia Arvaniti
Table 16.5 Results of one-tailed t-tests performed on the durations of the antepenultimate syllables of SS and RS words of all test word pairs. In all cases, df= 16. The syllables that are being compared are in upper case letters. Test pair
t
p <
1 2 3 4
5.94 3.58 4.47 5.6
0.0005 0.005 0.0005 0.0005
ePItropi SIMvuli siMEtoxi Eniko
Table 16.6 Results of one-tailed t-tests performed on the durations of the vowels of antepenultimate syllables ofSS and RS words of all test word pairs. In all cases, df=16. The vowels that are being compared are in upper case letters. Test pair
t
p <
1 2 3 4
6.83 5.75 8.22 5.6
0.0005 0.0005 0.0005 0.0005
epitropi slmvuli simEtoxi Eniko
antepenultimate vowel is also significantly longer in SS words, in all test word pairs (see table 16.6 for t-test results). By contrast, no significant differences were found between SS and RS words either in the duration of their final syllables or in that of final vowels when two-tailed tests were performed. Amplitude Results of amplitude measurements were not pooled; the measurements differed extensively between subjects so that pooling the data could result in statistical artifacts. Of all measurements only AI shows relatively consistent correlation between stress and amplitude for subject HP only. PA and RMS measurements did not yield any significant results for any subject. AI results for HP's data are shown in figure 16.2. One-tailed t-tests showed that all SS antepenults have significantly higher AI than their RS counterparts although the statistical results are not as strong as those of durational data (see table 16.7 for details). On the other hand, two-tailed t-tests on the 411
Prosody (a) v '
% of word's Al 100 i
80
iJllLu
60
40
20
PI pi epitropi
'
ME me simetoxl
E e eniko
% of word's Al
(b) V
SIM sim simvuli
100 i
80
60
40
20
IULUJI PI
pi
epitropi
Li li simvuli
XI xi simetoxi
KO ko eniko
Figure 16.2 (a) Al means, expressed as percentages, of antepenultimate syllables of SS words (left, upper case) and RS words (right, lower case) for subject H.P. (b) Same measurements for final syllables
final syllables of SS and RS words do not show significant differences between them. The results of Al measurements for MK, whose data do not show any correspondence between amplitude and stress, are presented in figure 16.3. Subject to further investigation, the results suggest that perhaps amplitude is not a strong stress correlate in Greek. Botinis (1989) however, found that, 412
16 Amalia Arvaniti (a)
% of word's Al 100 I
80
60
40
20
uJLlLJI PI pi epitropi
(b)
SIM sim simvuli
ME me simetoxi
E
eniko
e
% of word's Al 100 i
80
60
40
20
ILJLJLII PI pi epitropi
Li li simvuli
XI xi simetoxi
KO ko eniko
Figure 16.3 (a) Al means, expressed as percentages, of antepenultimate syllables of SS words (left, upper case) and RS words (right, lower case) for subject M.K. (b) Same measurements for final syllables
in his data, stressed syllables had significantly higher peak amplitude than unstressed syllables. On the other hand, he also reports that in perceptual tests amplitude changes did not affect the subjects' stress judgments. These results could mean that as amplitude is not a robust stress cue, its acoustical presence is not necessary and some speakers might opt not to use it. Clearly, both a more detailed investigation into how to measure amplitude, and data 413
Prosody Table 16.7 Results of one-tailed t-tests performed on the AI of the antepenultimate syllables ofSS and RS words of all test word pairs, for subject HP. In all cases, df= 4. The syllables that are being compared are in upper case letters Test pair
t
p <
1 2 3 4
6.62 3.58 6.45 3.73
0.005 0.025 0.005 0.025
ePItropi SIMvuli siMEtoxi Eniko
elicited from a larger number of speakers are needed before a conclusion is reached. Fundamental frequency Characteristic F o plots are shown in figure 16.4. There were no differences either within each subject's data or across subjects. The F o contours show a significant difference between SS and RS test words. In SS words, F o is high on the antepenult whereas in RS words, F o is very low and relatively flat. No important differences between the contours of last syllables of SS and RS words were found. They all started with slightly low F o that rose sharply to a high value. One noticeable effect is that in many cases the F o high is not associated with the beginning of the stressed syllable but rather with its end and the beginning of the following, unstressed syllable. This seems to be a characteristic of Greek stress as the results of Botinis (1989) confirm. 16.4 Discussion The results of the first experiment show that native speakers of Greek cannot differentiate between the rightmost lexical stress of a phrase and a SWFCinduced stress which fall on the same syllable of segmentally identical phrases. This implies that, contrary to the analyses of Setatos (1974) and of Nespor and Vogel (1986, 1989), the SWFC-induced stress is the most prominent stress in a host-and-clitic group, in the same way that the most prominent stress in a O is the rightmost lexical stress. This conclusion agrees with the description of the phenomenon by most analyses of Greek, both phonological (e.g. Joseph and Warburton 1987; Malikouti-Drachman and Drachman 1980) and phonetic (e.g. Botinis 1989), and also with the basic requirement of the SWFC; namely, that the main stress must fall at most three syllables to the left of its domain boundary. Moreover, the results 414
16 Amalia Arvaniti
3.50 e + 02
^^_
Fo Hz
e + 02
I
—
I
I
I
1.11
I
I
I
I
Time (sec.)
I 2.02
hp /tonenikotis/SS
(*)
0.00 1.15
Time (sec.)
3.50 e + 02 Fo Hz
1.00 e + 02
i
i
i
i
Time (sec.)
i
I
I
I 1.98
hp /tonenikotis/RS
Figure 16.4 Characteristic Fo contours together with the corresponding narrow band spectrograms for HP.'s /ton eniko tis/; (a) SS word (b) RS word. The thicker line on the plot represents the smoothed contour 415
Prosody
indicate that Botinis's proposal that the SWFC-induced stress belongs to a perceptually distinct prosodic category is incorrect. These results are corroborated by those of experiment 2. Starting with Malikouti-Drachman and Drachman's assumption that "secondary stress" (i.e. the weakened lexical stress of the host) and rhythmic stress are phonetically identical, it was shown that they are in fact very different both acoustically and perceptually. On the one hand, the syllables that carry "secondary stress" were shown to be acoustically more prominent than syllables thought to carry rhythmic stress. On the other hand, no acoustical evidence for rhythmic stress was found; syllables thought to carry rhythmic stress exhibited durations and F o contours similar to those of unstressed syllables. Finally, the data corroborate those of experiment 1 in that the final syllables in all test word pairs exhibited no acoustical differences between SS and RS words. These results indicate that the stress of both final syllables is primary whether lexical or SWFC-induced. I propose to account for the present results in the following way. Word (or lexical) stress placement is a lexical process while the SWFC-induced stress in host-and-clitic groups is the result of postlexical application of the SWFC. This difference becomes clear if one considers, again, cliticization and affixation. Although syllable addition is common to both of these processes they yield different results: whereas affixation results in a shift of the main stress as in (10) /'maGima/ "lesson" > /ma'Gimata/ "lessons" cliticization results in a stress addition, as in (11) /'maGima tu/>/,maGi'ma tu/ "his lesson" This is precisely because affixation takes place within the lexical component, whereas cliticization is a postlexical process. Thus, by leaving the lexical component all words, except clitics, form independent stress domains, like the final form in (12). (12)
SD s pi
w ra
SD w ma
"experiment"
> >
s w w pi ra ma
SD +
w ta
"experiment •+• s"
>
w s w w pi ra ma ta
>
"experiments"
The fact that all words constitute independent stress domains is true even of monosyllabic "content" words. The difference between those and clitics becomes apparent when one considers examples like (13): 416
16 Amalia Arvaniti
(13) /'anapse to 'fos/ "turn on the light" which shows that SWFC violations do not arise between words because these form separate stress domains. Clitics, however, remain unattached weak syllables until they are attached to a host postlexically. In this way, clitics extend the boundaries of words, i.e. of stress domains (SDs); clitics form compound SDs with their hosts. For example, (14)
SD w w s w w ton pa te ra mu
w w s w w ton pa te ra mu
w w s w w to(m) b a t e r a mu
"my father (ace.)"
The stress domain formed by the host and its enclitic still has to conform to the SWFC. When cliticization does not result in a SWFC violation no change of stress pattern is necessary. When, however, the SWFC is violated by the addition of enclitics, the results of the violation are different from those observed within the lexical component. This is precisely because the host has already acquired lexical stress and constitutes an independent stress domain with fixed stress. Thus, in SWFC violations the host's stress cannot move from its position, as it does within the lexical component. The only alternative, therefore, is for another stress to be added in such a position that it can comply with the SWFC, thus producing the stress two syllables to the right of the host's lexical stress. In this case the compound SD is divided into two SDs. SD
(15) SD
SD
w w s w w w
w w s w w w
to ti le fo no mas
to ti le fo no mas "our telephone"
SD
//W A >
w w s w s
w
to ti le fo no mas
In this way, the subordination of the first stress is captured, as well as the fact that both stresses still belong to one stress domain, albeit a compound one, and therefore they are at the same prosodic level. The disadvantage of 417
Prosody
this proposal is that there is no motivation for choosing between /fonomas/ and /nomas/ as the second constituent of the compound SD. Finally, another question that emerges from the experimental results is whether Greek metrical structure needs to present rhythmic stresses at all. Experimental data indicate that there is no acoustical evidence for rhythmic stress in Greek and that what has been often described as "secondary stress" is a weakened lexical stress. This statement may, at first, seem self-evident; however, it must be remembered that a weakened lexical stress and the subordinate stresses of words have not always been considered the same. Liberman and Prince (1977), for instance, refer to Trager and Smith (1951), who "argued for a distinction between nonprimary stresses within a word, and subordinated main stresses of independent words, a distinction that could be expressed by a one-level downgrading of all nonprimary stresses within the confines of a given word; thus 3 1 2 1 Tennessee but Aral Sea." (Liberman and Prince 1977: 255). If this line of argument is followed in the Greek examples, then the lexical stress of the host should become a "3 stress" rather than a "2 stress" since it becomes a subordinate stress within a word. This, however, does not happen; on the contrary, the lexical stress of the host remains strong enough to be perceptually identical to a subordinate main stress, as experiment 1 showed. In my opinion, these results cast doubt on the presence of "secondary stress," and consequently rhythmic stress, in Greek. Although further investigation is necessary before a solution is reached, there are strong arguments, in addition to the acoustical evidence, for proposing that Greek does not in fact exhibit rhythmic stress. Since experimental results fail to show any evidence for rhythmic stresses the only argument for their existence appears to be that they are heard. However, in certain cases, different investigators have conflicting opinions as to the placement of rhythmic stresses. For instance, on a phrase like (16) /me'yali katastro'fi/ "great destruction" the rhythmic stress should be on /ka/, according to the rules of Nespor and Vogel (1989) while, according to Malikouti-Drachman and Drachman (1980), the rhythmic stress should fall on /ta/. Differences between the two analyses become greater as the number of unstressed syllables between lexical stresses increases. Moreover, only phonological analyses suggest that Greek exhibits rhythmic stress. Phonetic analyses either do not mention the matter (Dauer 1980; Botinis 1989), or fail to find evidence for rhythmic stress; for instance, Fourakis (1986) concludes, on the basis of durational data, that Greek seems 418
16 Amalia Arvaniti
to have a two way distinction: ± stress with no gradations. Although more detailed research into the presence of rhythmic stress is necessary there is fairly strong support for postulating an «-ary branching analysis in which no rhythmic stresses are marked. If further evidence confirms the present results, then the universality of binary rhythmic patterns could be questioned. 16.5 Conclusion
It has been shown that the Greek Stress Well-Formedness Condition applies both lexically, moving lexical stress to the right of its original position, and postlexically, adding a stress two syllables to the right of the host in a hostand-clitic group. In the latter case, the SWFC-induced stress becomes the most prominent in the group; this stress was shown to be perceptually identical to a lexical stress. The weakened lexical stress of the host was shown to be acoustically and perceptually similar to a subordinate lexical stress and not to rhythmic stress as has often been thought. The experimental evidence together with the absence of strong phonological arguments to the contrary suggest that Greek might not exhibit rhythmic stresses at all.
419
Prosody
Appendix 1 The test phrases (bold type) of experiment 1 in the context in which they were read
l(a) /tu 'ipa yi'a to 'ari'sta su ke 'xarike po'li/ "I told him about your first-class mark and he was very pleased." (b) /e'yo tu fo'nazo 'ari stasu ki a'ftos 6e stama'tai/ "I shout at him Ari stop, but he doesn't stop." 2(a) /pso'nizi 'panda a'po to psaradi'ko tus/ "S/he always shops from their fishmongery." (b) /'ixan a'neka0en psa'ra di'ko tus/ "They have always had their own fishmonger."
420
16 Amalia Arvaniti
Appendix 2 The distractors (bold type) of experiment 1 in the context in which they were read
l(a) /pi'stevo 'oti 'ksero to 'mono 'loyo yi'a a'fti tin ka'tastasi/ "I believe I know the only reason for this situation." (b) /den 'exo a'kusi pi'o vare'to mo'noloyo sto 'Oeatro/ "I haven't listened to a more boring theatrical monologue." 2(a) /6e '9elo 'pare 'dose me a'fto to 'atomo/ "I don't want to have anything to do with this person." (b) /'ksero 'oti to pa'redose stus dike'uxus/ "I know that he delivered it to the beneficiaries."
421
Prosody
Appendix 3 The test sentences of experiment 2. The test words are in bold type l(a) /i epitropi mas 'itan a'kurasti/ "Our commissioners were indefatigable." (b) /i epitro'pi mas 'itan a'kurasti/ "Our committee was indefatigable." 2(a) /pi'stevo 'oti i .simvu'li tu 'itan so'fi/ "I believe that his counsellors were wise." (b) /pi'stevo 'oti i simvu'li tu 'itan so'fi/ "I believe that his advice was wise." 3(a) 'no'mizo 'oti i simeto'xi tu 'ine e'ksisu apa'retiti/ "I think that his co-participants are equally necessary." (b) /no'mizo 'oti i simeto'xi tu 'ine e'ksisu apa'retiti/ "I think that his participation is equally necessary." 4(a) /mu 'eleye 'oti 'vriski ton eniko tis po'li enoxliti'ko/ "S/he was telling me that s/he finds her tenant very annoying." (b) /mu 'eleye 'oti 'vriski ton eni'ko tis po'li enoxliti'ko/ "S/he was telling me that s/he finds her 'singular' [nonuse of politeness forms] very annoying."
422
16 Amalia Arvaniti
Appendix 4 The distractor sentences of experiment 2. The distractors are in bold type l(a) (b) 2(a) (b) 3(a) (b) 4(a) (b)
/ma a'fto 'ine porto'kali/ "But this is an orange." /ma a'fto 'ine portoka li "But this is orange." /i a'poxi tus 'itan po'li me'yali/ "Their hunting-net was very big." /i apo'xi tus 'itan po'li me'yali/ "Their abstention was very big." /teli'ka to 'kerdise to me'talio/ "Finally he won the medal." /teli'ka to 'kerdise to meta'lio/ "Finally he won the mine." /stin omi'lia tu ana'ferGike stus 'nonius tus/ "In his speech he referred to their laws." /stin omi'lia tu ana'fer9ike stus no'mus tus/ "In his speech he referred to their counties."
423
References
Abbreviations ARIPUC Annual Report, Institute of Phonetics, University of Copenhagen CSLI Center for the Study of Language and Information IPO Instituut voor Perceptie Onderzoek (Institute for Perception Research) IULC Indiana University Linguistics Club JASA Journal of the Acoustical Society of America JL Journal of Linguistics JPhon Journal of Phonetics JSHR Journal of Speech and Hearing Research JVLVB Journal of Verbal Language and Verbal Behaviour LAGB Linguistics Association of Great Britain Lg Language Lg & Sp Language and Speech LI Linguistic Inquiry MITPR Massachusetts Institute of Technology, Progress Report NELS North East Linguistic Society PERILUS Phonetic Experimental Research at the Institute of Linguistics, University of Stockholm Proc. IEEE Int. Conf Ac, Sp. & Sig. Proc. Proceedings of the Institute of Electrical and Electronics Engineers Conference on Acoustics, Speech and Signal Processing PY Phonology Yearbook RILP Report of the Institute of Logopedics and Phoniatrics STL-QPSR Quarterly Progress and Status Report, Speech Transmission Laboratory, Royal Institute of Technology (Stockholm) WCCFL West Coast Conference on Formal Linguistics Albrow, K. H. 1975. Prosodic theory, Hungarian and English. Festschrift fur Norman Denison zum 50. Geburtstag (Grazer Linguistiche Studien, 2). Graz: University of Graz Department of General and Applied Linguistics. Alfonso, P. and T. Baer. 1982. Dynamics of vowel articulation. Lg & Sp 25: 151-73. 424
References Ali, L. H., T. Gallagher, J. Goldstein, and R. G. Daniloff. 1971. Perception of coarticulated nasality. JASA 49: 538^0. Allen, J., M. S. Hunnicutt, and D. Klatt. 1987. From Text to Speech: the MITalk System. Cambridge: Cambridge University Press. Anderson, L. B. 1980. Using asymmetrical and gradient data in the study of vowel harmony. In R. M. Vago (ed.), Issues in Vowel Harmony. Amsterdam: John Benjamins. Anderson, M., J. Pierrehumbert, and M. Liberman. 1984. Synthesis by rule of English intonation patterns. Proc. IEEE Int. Conf. Ac, Sp. & Sig. Proc. 282-4. New York: IEEE. Anderson, S. R. 1974. The Organization of Phonology. New York: Academic Press. 1978. Tone features. In V. Fromkin (ed.), Tone: a Linguistic Survey. New York: Academic Press. 1982. The analysis of French schwa. Lg 58: 535-73. 1986. Differences in rule type and their structural basis. In H. van der Hulst and N. Smith (eds.), The Structure of Phonological Representations, part 2. Dordrecht: Foris. Anderson, S. R. and W. Cooper. Fundamental frequency patterns during spontaneous picture description. Ms. University of Iowa. Archangeli, D. 1984. Underspecification in Yawelmani phonology. Doctoral dissertation, Cambridge, MIT. 1985. Yokuts harmony: evidence for coplanar representation in nonlinear phonology. LI 16: 335-72. 1988. Aspects of underspecification theory. Phonology 5.2: 183-207. Arvaniti, A. 1990. Review of A. Botinis, 1989. Stress and Prosodic Structure in Greek: a Phonological, Acoustic, Physiological and Perceptual Study. Lund University Press. JPhon 18: 65-9. Bach, E. 1968. Two proposals concerning the simplicity metric in phonology. Glossa 4: 3-21. Barry, M. 1984. Connected speech: processes, motivations and models. Cambridge Papers in Phonetics and Experimental Linguistics 3. 1985. A palatographic study of connected speech processes. Cambridge Papers in Phonetics and Experimental Linguistics 4. 1988. Assimilation in English and Russian. Paper presented at the colloquium of the British Association of Academic Phoneticians, Trinity College, Dublin, March 1988. Beattie, G., A. Cutler, and M. Pearson. 1982. Why is Mrs Thatcher interrupted so often? Nature 300: 744-7. Beckman, M. E. 1986. Stress and Non-Stress Accent (Netherlands Phonetic Archives 7). Dordrecht: Foris. Beckman, M. E. and J. Kingston. 1990. Introduction to J. Kingston and M. Beckman (eds.), Papers in Laboratory Phonology I: Between the Grammar and the Physics of Speech. Cambridge: Cambridge University Press, 1-16. Beckman, M. E. and J. B. Pierrehumbert. 1986. Intonational structure in English and Japanese. PY 3: 255-310. 425
References Beddor, P. S., R. A. Krakow, and L. M. Goldstein. 1986. Perceptual constraints and phonological change: a study of nasal vowel height. PY 3: 197-217. Bell-Berti, F. and K. S. Harris. 1981. A temporal model of speech production. Phonetica 38: 9-20. Benguerel, A. P. and T. K. Bhatia. 1980. Hindi stop consonants: an acoustic and fiberscopic study. Phonetica 37: 134—48. Benguerel, A. P. and H. Cowan. 1974. Coarticulation of upper lip protrusion in French. Phonetica 30: 41-55. Berendsen, E. 1986. The Phonology of Cliticization. Dordrecht: Foris. Bernstein, N. A. 1967. The Coordination and Regulation of Movements. London: Pergamon. Bertch, W. F., J. C. Webster, R. G. Klumpp, and P. O. Thomson. 1956. Effects of two message-storage schemes upon communications within a small problem-solving group. JASA 28: 550-3. Bickley, C. 1982. Acoustic analysis and perception of breathy vowels. Working Papers, MIT Speech Communications 1: 74—83. Bing, J. M. 1979. Aspects of English prosody. Doctoral dissertation, University of Massachusetts. Bird, S. and E. Klein. 1990. Phonological events. JL 26: 33-56. Bloch, B. 1941. Phonemic overlapping. American Speech 16: 278-84. Bloomfield, L. 1933. Language. New York: Holt. Blumstein, S. E. and K. N. Stevens. 1979. Acoustic invariance in speech production: evidence from the spectral characteristics of stop consonants. JASA 66: 1011-17. Bolinger, D. 1951. Intonation: levels versus configuration. Word!: 199-210. 1958. A theory of pitch accent in English. Word 14: 109-49. 1986. Intonation and its Parts. Stanford, CA: Stanford University Press. Botha, R. P. 1971. Methodological Aspects of Transformational Generative Phonology. The Hague: Mouton. Botinis, A. 1989. Stress and Prosodic Structure in Greek: A Phonological, Acoustic, Physiological and Perceptual Study. Lund: Lund University Press. Boyce, S. 1986. The "trough" phenomenon in Turkish and English. JASA 80: S.95 (abstract). Broe, M. 1988. A unification-based approach to prosodic analysis. Edinburgh Working Papers in Linguistics 21: 63-82. Bromberger, S. and M. Halle. 1989. Why phonology is different. L/20.1: 51-69. Browman, C. P. 1978. Tip of the tongue and slip of the ear: implications for language processing. UCLA Working Papers in Phonetics, 42. Browman, C. P. and L. Goldstein. 1985. Dynamic modeling of phonetic structure. In V. Fromkin (ed.), Phonetic Linguistics. New York: Academic Press. 1986. Towards an articulatory phonology. PY 3: 219-52. 1988. Some notes on syllable structure in articulatory phonology. Phonetica 45: 140-55. 1989. Articulatory gestures as phonological units. Phonology, 6.2: 201-51. 1990. Tiers in articulatory phonology with some implications for casual speech. In J. Kingston and M. Beckman (eds.), Papers in Laboratory Phonology I: Between 426
References the Grammar and the Physics of Speech. Cambridge: Cambridge University Press, 341-76. Browman, C. P., L. Goldstein, E. L. Saltzman, and C. Smith. 1986. GEST: a computational model for speech production using dynamically defined articulatory gestures. JASA, 80, Suppl. 1 S97 (abstract). Browman, C. P., L. Goldstein, J. A. S. Kelso, P. Rubin, and E. L. Saltzman. 1984. Articulatory synthesis from underlying dynamics. JASA 75: S22-3. (abstract). Brown, G., K. Currie, and J. Kenworthy. 1980. Questions of Intonation. London: Croom Helm. Brown, R. W. and D. McNeill. 1966. The "tip of the tongue" phenomenon. JVLVB 5: 325-37. Bruce, G. 1977. Swedish word accents in sentence perspective. Lund: Gleerup. 1982a. Developing the Swedish intonation model. Working Papers, Department of Linguistics, University of Lund, 22: 51-116. 1982b. Textual aspects of prosody in Swedish. Phonetica 39: 274-87. Bruce, G. and E. Garding. 1978. A prosodic typology for Swedish dialects. In E. Garding, G. Bruce, and R. Bannert (eds.), Nordic Prosody. Lund: Gleerup. Bullock, D. and S. Grossberg. 1988. The VITE model: a neural command circuit for generating arm and articulator trajectories. In J. A. S. Kelso, A. J. Mandell, and M. F. Shlesinger (eds.), Dynamic Patterns in Complex Systems. Singapore: World Scientific, 305-26. Carlson, L. 1983. Dialogue Games: an Approach to Discourse Analysis (Synthese Language Library 17). Dordrecht: Reidel. Carnochan, J. 1957. Gemination in Hausa. In Studies in Linguistic Analysis. The Philological Society, Oxford: Basil Blackwell. Catford, J. C. 1977. Fundamental Problems in Phonetics. Edinburgh: Edinburgh University Press. Chang, N-C. 1958. Tones and intonation in the Chengtu dialect (Szechuan, China). Phonetica 2: 59-84. Chao, Y. R. 1932. A preliminary study of English intonation (with American variants) and its Chinese equivalents. T'sai Yuan Pei Anniversary Volume, Suppl. Vol. 1 Bulletin of the Institute of History and Philology of the Academica Sinica. Peiping. Chatterjee, S. K. 1975. Origin and Development of the Bengali Language. Calcutta: Rupa. Chiba, T. and M. Kajiyama. 1941. The Vowel. Its Nature and Structure. Tokyo: Taseikan. Choi, J. 1989. Some theoretical issues in the analysis of consonant to vowel spreading in Kabardian. MA thesis, Department of Linguistics, UCLA. Chomsky, N. 1964. The nature of structural descriptions. In N. Chomsky, Current Issues in Linguistic Theory. The Hague: Mouton. 1965. Aspects of the Theory of Syntax. Cambridge, MA: MIT Press. Chomsky, N. and M. Halle. 1965. Some controversial questions in phonological theory. JL 1:97-138. 1968. The Sound Pattern of English. New York: Harper and Row. 427
References Clark, M. 1990. The Tonal System of Igbo. Dordrecht: Foris. Clements, G. N. 1976. The autosegmental treatment of vowel harmony. In W. Dressier and O. Pfeiffer (eds.), Phonologica 1976. Innsbruck: Innsbrucker Beitrage zur Sprachwissenschaft. 1978. Tone and syntax in Ewe. In D. J. Napoli (ed.), Elements of Tone, Stress, and Intonation. Georgetown: Georgetown University Press. 1979. The description of terrace-level tone languages. Lg 55: 536-58. 1981. The hierarchical representation of tone features. Harvard Studies in Phonology 2: 50-105. 1984. Principles of tone assignment in Kikuyu. In G. N. Clements and J. Goldsmith (eds.), Autosegmental Studies in Bantu Tone. Dordrecht: Foris, 281-339. 1985. The geometry of phonological features. PY 2: 225-52. 1986. Compensatory lengthening and consonant gemination in Luganda. In L. Wetzels and E. Sezer (eds.), Studies in Compensatory Lengthening. Dordrecht: Foris. 1987. Phonological feature representation and the description of intrusive stops. Papers from the Parasession on Autosegmental and Metrical Phonology. Chicago Linguistics Society, University of Chicago. 1990a. The role of the sonority cycle in core syllabification. In J. Kingston and M. Beckman (eds.), Papers in Laboratory Phonology I: Between the Grammar and the Physics of Speech. Cambridge: Cambridge University Press, 283-333. 1990b. The status of register in intonation theory. In J. Kingston and M. Beckman (eds.), Papers in Laboratory Phonology I: Between the Grammar and the Physics of Speech. Cambridge: Cambridge University Press, 58-71. Clements, G. N. and J. Goldsmith. 1984. Introduction. In G. N. Clements and J. Goldsmith (eds.), Autosegmental Studies in Bantu Tone. Dordrecht: Foris. Clements, G. N. and S. J. Keyser. 1983. CV Phonology: a Generative Theory of the Syllable. Cambridge, MA: MIT Press. Cohen, A. and J. 't Hart. 1967. On the anatomy of intonation. Lingua 19: 177-92. Cohen, A., R. Collier, and J. 't Hart. 1982. Declination: construct or intrinsic feature of speech pitch? Phonetica 39: 254-73. Cohen, J. and P. Cohen. 1983. Applied Multiple Regression/ Correlation Analysis for the Behavioral Sciences, 2nd edn. Hillsdale, NJ: Lawrence Erlbaum. Coleman, J. 1987. Knowledge-based generation of speech synthesis parameters. Ms. Experimental Phonetics Laboratory, Department of Language and Linguistic Science, University of York. 1989. The phonetic interpretation of headed phonological structures containing overlapping constituents. Manuscript. Coleman, J. and J. Local. 1991. "Constraints" in autosegmental phonology. To appear in Linguistics and Philosophy. Collier, R. 1989. On the phonology of Dutch intonation. In F. J. Heyvaert and F. Steurs (eds.), Worlds behind Words. Leuven: Leuven University Press. Collier, R. and J. 't Hart. 1981. Cursus Nederlandse intonatie. Leuven: Acco. Connell, B. and D. R. Ladd 1990. Aspects of pitch realisation in Yoruba. Phonology 1: 1-29. 428
References Cooper, A. forthcoming. Stress effects on laryngeal gestures. Cooper, W. E. and J. M. Paccia-Cooper. 1980. Syntax and Speech. Cambridge, MA: Harvard University Press. Cooper, W. E. and J. Sorensen, 1981. Fundamental Frequency in Sentence Production. New York: Springer. Costa, P. J. and I. G. Mattingly. 1981. Production and perception of phonetic contrast during phonetic change. Status Report on Speech Research Sr-67/68. New Haven: Haskins Laboratories, 191-6. Cotton, S. and F. Grosjean. 1984. The gating paradigm: a comparison of successive and individual presentation formats. Perception and Psychophysics 35: 41-8. Crompton, A. 1982. Syllables and segments in speech production. In A. Cutler (ed.), Slips of the Tongue and Language Production. Amsterdam: Mouton. Crystal, D. 1969. Prosodic Systems and Intonation in English. Cambridge: Cambridge University Press. Cutler, A. 1980. Errors of stress and intonation. In V. A. Fromkin (ed.), Errors in Linguistic Performance. New York: Academic Press. 1987. Phonological structure in speech recognition. PY 3: 161-78. Cutler, A., J. Mehler, D. Norris, and J. Segui. 1986. The syllable's differing role in the segmentation of French and English. Journal of Memory and Language 25: 385-400. Dalby, J. M. 1986. Phonetic Structure of Fast Speech in American English. Bloomington: IULC. Daniloff, R. and R. E. Hammarberg. 1973. On defining coarticulation. JPhon, 1: 239-48. Daniloff, R., G. Shuckers, and L. Feth. 1980. The Physiology of Speech and Hearing. Englewood Cliffs, NJ: Prentice Hall. Dauer, R. M. 1980. Stress and rhythm in modern Greek. Doctoral dissertation, University of Edinburgh. Delattre, P. 1966. Les Dix Intonations de base du Francais. French Review 40: 1-14. 1971. Pharyngeal features in the consonants of Arabic, German, Spanish, French, and American English. Phonetica 23: 129-55. Dell, F. forthcoming. L'Accentuation dans les phrases en Francais. In F. Dell, J.-R. Vergnaud, and D. Hirst (eds.), Les Representations en phonologic Paris: Hermann. Dev, A. T. 1973. Students' Favourite Dictionary. Calcutta: Dev Sahitya Kutir. Diehl, R. and K. Kluender. 1989. On the objects of speech perception. Ecological Psychology 1.2: 121-44. Dinnsen, D. A. 1983. On the Characterization of Phonological Neutralization. IULC. 1985. A re-examination of phonological neutralisation. JL 21: 265-79. Dixit, R. P. 1987. In defense of the phonetic adequacy of the traditional term "voiced aspirated." UCLA Working Papers in Phonetics 67: 103-11. Dobson, E. J. 1968. English Pronunciation. 1500-1700. 2nd edn. Oxford: Oxford University Press. Docherty, G. J. 1989. An experimental phonetic study of the timing of voicing in English obstruents. Doctoral dissertation, University of Edinburgh. 429
References Downing, B. 1970. Syntactic structure and phonological phrasing in English. Doctoral dissertation, University of Texas. Dowty, D. R., R. E. Wall, and S. Peters. 1981. Introduction to Montague Semantics. Dordrecht: Reidel. Erikson, D. 1976. A physiological analysis of the tones of Thai. Doctoral dissertation, University of Connecticut. Erikson, Y. 1973. Preliminary evidence of syllable locked temporal control of Fo. STL-QPSR 2-3: 23-30. Erikson, Y. and M. Alstermark. 1972. Fundamental frequency correlates of the grave accent in Swedish: the effect of vowel duration. STL-QPSR 2-3: 53-60. Fant, G. 1959. Acoustic analysis and synthesis of speech with applications to Swedish. Ericsson Technics 1. 1960. Acoustic Theory of Speech Production. The Hague: Mouton. Fant, G. and Q. Linn. 1988. Frequency domain interpretation and derivation of glottal flow parameters. STL-QPSR 2-3: 1-21. Ferguson, C. A. and M. Chowdhury. 1960. The phonemes of Bengali. Lg 36.1: 22-59. Firth, J. R. 1948. Sounds and prosodies. Transactions of the Philological Society 127-52; also in F. R. Palmer (ed.) Prosodic Analysis. Oxford: Oxford University Press. 1957. A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis. The Philological Society, Oxford: Basil Black well. Fischer-Jorgensen, E. 1975. Trends in Phonological Theory. Copenhagen: Akademisk. Flanagan, J. L. 1972. Speech Analysis, Synthesis, and Perception, 2nd edn. Berlin: Springer. Folkins, J. W. and J. H. Abbs. 1975. Lip and jaw motor control during speech: responses to resistive loading of the jaw. JSHR 18: 207-20. Foss, D. J. 1969. Decision processes during sentence comprehension: effects of lexical item difficulty and position upon decision times. JVLVB 8: 457-62. Foss, D. J. and M. A. Blank. 1980. Identifying the speech codes. Cognitive Psychology 12: 1-31. Foss, D. J. and M. A. Gernsbacher. 1983. Cracking the dual code: toward a unitary model of phoneme identification. JVLVB 22: 609-32. Foss, D. J. and D. A. Swinney. 1973. On the psychological reality of the phoneme: perception, identification, and consciousness. JVLVB 12: 246-57. Foss, D. J., D. A. Harwood, and M. A. Blank. 1980. Deciphering decoding decisions: data and devices. In R. A. Cole (ed.), The Perception and Production of Fluent Speech, Hillsdale, NJ: Lawrence Erlbaum. Fourakis M. 1986. An acoustic study of the effects of tempo and stress on segmental intervals in modern Greek. Phonetica 43: 172-88. Fourakis, M. and R. Port. 1986. Stop epenthesis in English. JPhon 14: 197-221. Fowler, C. A. 1977. Timing Control in Speech Production. Bloomington, IULC. 1980. Coarticulation and theories of extrinsic timing control. JPhon 8: 113-33. 1981a. Perception and production of coarticulation among stressed and unstressed vowels. JSHR 24: 127-39. 430
References 1981b. A relationship between coarticulation and compensatory shortening. Phonetica 38: 35-50. 1985. Current perspectives on language and speech perception: a critical overview. In R. Daniloff (ed.), Speech Science: Recent Advances. San Diego, CA: College-Hill. 1986. An event approach to the study of speech perception from a direct-realist perspective. JPhon 14: 3-28. Fowler, C. A., P. Rubin, R. E. Remez, and M. T. Turvey. 1980. Implications for speech production of a skilled theory of action. In B. Butterworth (ed.), Language Production I. London: Academic Press. Frederiksen, J. R. 1967. Cognitive factors in the recognition of ambiguous auditory and visual stimuli. (Monograph) Journal of Personality and Social Psychology 7. Franke, F. 1889. Die Umgangssprache der Nieder-Lausitz in ihren Lauten. Phonetische Studien II, 21. Fritzell, B. 1969. The velopharyngeal muscles in speech. Acta Otolaryngologica. Suppl. 250. Fromkin, V. A. 1971. The non-anomalous nature of anomalous utterances. Lg 47: 27-52. 1976. Putting the emPHAsis on the wrong sylLABle. In L. M. Hyman (ed.), Studies in Stress and Accent. Los Angeles: University of Southern California. Fry, D. B. 1955. Duration and intensity as physical correlates of linguistic stress. JASA 27: 765-8. 1958. Experiments in the perception of stress. Lg & Sp 1: 126-52. Fudge, E. C. 1987. Branching structure within the syllable. JL 23: 359-77. Fujimura, O. 1962. Analysis of nasal consonants. JASA 34: 1865-75. 1986. Relative invariance of articulatory movements: an iceberg model. In J. S. Perkell and D. H. Klatt (eds.), Invariance and Variability in Speech Processes. Hillsdale, NJ: Lawrence Erlbaum. 1987. Fundamentals and applications in speech production research. Proceedings of the Eleventh International Congress of Phonetic Sciences. 6: 10-27. 1989a. An overview of phonetic and phonological research. Nihongo to Nihongo Kyooiku 2: 365-89. (Tokyo: Meiji-shoin.) 1989b. Comments on "On the quantal nature of speech", by K. N. Stevens. JPhon 17: 87-90. 1990. Toward a model of articulatory control: comments on Browman and Goldstein's paper. In J. Kingston and M. Beckman (eds.), Papers in Laboratory Phonology I: Between the Grammar and the Physics of Speech. Cambridge: Cambridge University Press, 377-81. Fujimura, O. and M. Sawashima. 1971. Consonant sequences and laryngeal control. Annual Bulletin of the Research Institute of Logopedics and Phoniatrics 5: 1-6. Fujisaki, H. and K. Hirose. 1984. Analysis of voice fundamental frequency contours for declarative sentences of Japanese. Journal of the Acoustical Society of Japan 5.4: 233-42. 431
References Fujisaki, H. and H. Keikichi. 1982. Modelling the dynamic characteristics of voice fundamental frequency with applications to analysis and synthesis of intonation. Preprints of Papers, Working Group on Intonation, Thirteenth International Congress of Linguists, Tokyo. Fujisaki, H. and S. Nagashima. 1969. A model for the synthesis of pitch contours of connected speech. Tokyo University Engineering Research Institute Annual Report 28: 53-60. Fujisaki, H. and H. Sudo. 1971a. A generative model for the prosody of connected speech in Japanese. Tokyo University Engineering Research Institute Annual Report 30: 75-80. 1971b. Synthesis by rule of prosodic features of Japanese. Proceedings of the Seventh International Congress of Acoustics 3: 133-6. Fujisaki, H., M. Sugito, K. Hirose, and N. Takahashi. 1983. Word accent and sentence intonation in foreign language learning. Preprints of Papers, Working Group on Intonation, Thirteenth International Congress of Linguists, Tokyo: 109-19. Gage, W. 1958. Grammatical structures in American English intonation. Doctoral dissertation, Cornell University. Gamkrelidze, T. V. 1975. On the correlation of stops and fricatives in a phonological system. Lingua 35: 231-61. Garding, E. 1983. A generative model of intonation. In A. Cutler and D. R. Ladd (eds.), Prosody: Models and Measurements. Heidelberg: Springer. Garding, E., A. Botinis, and P. Touati. 1982. A comparative study of Swedish, Greek and French intonation. Working Papers, Department of Linguistics, University of Lund, 22: 137-52. Gay, T. 1977. Articulatory movements in VCV sequences. JASA 62: 183-93. 1978. Articulatory units: segments or syllables. In A. Bell and J. Hooper (eds.), Segments and Syllables. Amsterdam: North Holland. 1981. Mechanisms in the control of speech rate. Phonetica 38: 148-58. Gazdar, G., E. Klein, G. Pullum, and I. Sag. 1985. Generalised Phrase Structure Grammar. London: Basil Blackwell. Gimson, A. C. 1960. The instability of English alveolar articulations. Le Maitre Phonetique 113: 7-10. 1970. An Introduction to the Pronunciation of English. London: Edward Arnold. Gobi, C. 1988. Voice source dynamics in connected speech. STL-QPSR 1: 123-59. Goldsmith, J. 1976. Autosegmental Phonology. MIT Doctoral dissertation. New York: Garland, 1979. 1984. Tone and accent in Tonga. In G. N. Clements and J. Goldsmith (eds.), Autosegmental Studies in Bantu Tone (Publications in African Languages and Linguistics 3). Dordrecht: Foris, 19-51. Gracco, V. and J. Abbs. Variant and invariant characteristics of speech movements. Experimental Brain Research 65: 165-6. Greene, P.H. 1971. Introduction. In I. M. Gelfand, V. S. Gurfinkel, S. V. Fomin, and M. L. Tsetlin (eds.), Models of Structural Functional Organization of Certain Biological Systems. Cambridge, MA: MIT Press, xi-xxxi. 432
References Gronnum, N. forthcoming. Prosodic parameters in a variety of regional Danish standard languages, with a view towards Swedish and German. To appear in Phone tica. Grosjean, F. 1980. Spoken word recognition processes and the gating paradigm. Perception and Psychophysics 28: 267-83. Giinther, H. 1988. Oblique word forms in visual word recognition. Linguistics 26: 583-600. Gussenhoven, C. 1983. Focus, mode and the nucleus. JL 19: 377-417. 1984. On the Grammar and Semantics of Sentence Accents. Dordrecht: Foris. 1988. Adequacy in intonational analysis: the case of Dutch. In H. van der Hulst and N. Smith (eds.), Autosegmental Studies in Pitch Accent. Dordrecht: Foris. forthcoming. Intonational phrasing and the prosodic hierarchy. Phonologica 1988. Cambridge: Cambridge University Press. Gussenhoven, C. and T. Rietveld. 1988. Fundamental frequency declination in Dutch: testing three hypotheses. JPhon 16: 355-69. Hakoda, K. and H. Sato. 1980. Prosodic rules in connected speech synthesis. Trans. IECE. 63-D No. 9: 715-22. Halle, M. and K. N. Stevens. 1971. A note on laryngeal features. MITPR 101: 198-213. Halle, M. and J. Vergnaud. 1980. Three dimensional phonology. Journal of Linguistic Research 1: 83-105. Halliday, M. A. K. 1967. Intonation and Grammar in British English. The Hague: Mouton. Hammond, M. 1988. On deriving the well-formedness condition. LI 19: 319-25. Han, M. S. 1962. Japanese Phonology: An Analysis Based on Sound Spectrograms. Tokyo: Kenkyusha. Haraguchi, S. 1977. The Tone Pattern of Japanese: An Autosegmental Theory of Tonology. Tokyo: Kaitakushi. Hardcastle, W. J. 1972. The use of electropalatography in phonetic research. Phonetica 25: 197-215. 1976. Physiology of Speech Production: An Introduction for speech scientists. London: Academic Press. Harris, Z. H. 1944. Simultaneous components in phonology. Lg 20: 181-205. Harshman, R., P. Ladefoged and L. Goldstein. 1977. Factor analysis of tongue shapes. JASA 62: 693-707. 't Hart, J. 1979a. Naar automatisch genereeren van toonhoogte-contouren voor tamelijk lange stukken spraak. IPO Technical Report No. 353, Eindhoven. 1979b. Explorations in automatic stylization of Fo curves. IPO Annual Progress Report 14: 61-5. 1981. Differential sensitivity to pitch distance, particularly in speech. JASA 69: 811-21. 't Hart, J. and A. Cohen. 1973. Intonation by rule: a perceptual quest. JPhon 1: 309-27. 't Hart, J. and R. Collier. 1975. Integrating different levels of intonation analysis. JPhon 3: 235-55. 433
References 1979. On the interaction of accentuation and intonation in Dutch. Proceedings of The Ninth International Congress of Phonetic Sciences 2: 385-402. Hawking, S. W. 1988. A Brief History of Time. London: Bantam Press. Hawkins, S. 1984. On the development of motor control in speech: evidence from studies of temporal coordination. In N. J. Lass (ed.), Speech and Language: Advances in Basic Research and Practice 11. ZY1-1A. Hayes, B. 1981. A Metrical theory of stress rules. Bloomington: IULC. 1986. Inalterability in CV Phonology. Lg 62.2: 321-51. 1989. Compensatory lengthening in moraic phonology. LI 20.2: 253-306. Hayes, B. and A. Lahiri. 1991. Bengali intonational phonology. Natural Language and Linguistic Theory 9.1: 47-96. Hayes, B. and S. Puppel. 1985. On the rhythm rule in Polish. In H. van der Hulst and N. Smith (eds.), Advances in Non-Linear Phonology. Dordrecht: Foris. Hebb, D. O. 1949. The Organization of Behavior. New York: Wiley. Helfrich, H. 1979. Age markers in speech. In K. R. Scherer and H. Giles (eds.), Social Markers in Speech. Cambridge: Cambridge University Press. Henderson, J. B. 1984. Velopharyngeal function in oral and nasal vowels: a crosslanguage study. Doctoral dissertation, University of Connecticut. Henke, W. L. 1966. Dynamic articulatory model of speech production using computer simulation. Doctoral dissertation, MIT. Hewlett, N. 1988. Acoustic properties of /k/ and /t/ in normal and phonologically disordered speech. Clinical Linguistics and Phonetics 2: 29—45. Hirose, K., H. Fujisaki, and H. Kawai. 1985. A system for synthesis of connected speech - special emphasis on the synthesis of prosodic features. Onsei Kenkyuukai S85^13: 325-32. The Acoustical Society of Japan. Hirose, K., H. Fujisaki, M. Yamaguchi, and M. Yokoo. 1984. Synthesis of fundamental frequency contours of Japanese sentences based on syntactic structure (in Japanese). Onsei Kenkyuukai S83-70: 547-54. The Acoustical Society of Japan. Hirschberg, J. and J. Pierrehumbert. 1986. The intonational structuring of discourse. Proceedings of the 24th Annual Meeting, Association for Computational Linguistics, 136-44. Hjelmslev, L. 1953. Prolegomena to a Theory of Language, Memoir 7, translated by F. J. Whitfield. Baltimore: Waverly Press. Hockett, C. F. 1958. A Course in Modern Linguistics. New York: Macmillan. Hombert, J-M. 1986. Word games: some implications for analysis of tone and other phonological processes. In J. J. Ohala and J. J. Jaeger (eds.), Experimental Phonology. Orlando, FL: Academic Press. Honda, K. and O. Fujimura. 1989. Intrinsic vowel F o and phrase-final lowering: Phonological vs. biological explanations. Paper presented at the 6th Vocal Fold Physiology Conference, Stockholm, August 1989. Hooper, J. B. 1976. An Introduction to Natural Generative Phonology. New York: Academic Press. Houlihan, K. and G. K. Iverson. 1979. Functionally constrained phonology. In D. Dinnsen (ed.), Current Approaches to Phonological Theory. Bloomington: Indiana University Press. 434
References Householder, F. 1957. Accent, juncture, intonation, and my grandfather's reader. Word 13: 234^45. 1965. On some recent claims in phonological theory. JL 1: 13-34. Huang, C-T. J. 1980. The metrical structure of terraced level tones. In J. Jensen (ed.), NELS 11. Department of Linguistics, University of Ottawa. Huggins, A. W. F. 1964. Distortion of the temporal pattern of speech: interruption and alternation. JASA 36: 1055-64. Huss, V. 1978. English word stress in the post-nuclear position. Phonetica 35: 86-105. Hyman, L. M. 1975. Phonology: Theory and Analysis. New York: Holt, Rinehart, and Winston. 1985. A Theory of Phonological Weight. Dordrecht: Foris. Jackendoff, R. 1972. Semantic Interpretation in Generative Grammar. Cambridge, MA: MIT Press. Jakobson, R., G. Fant, and M. Halle. 1952. Preliminaries to Speech Analysis: the Distinctive Features and their Correlates. Cambridge, MA: MIT Press. Jespersen, O. 1904. Phonetische Grundfragen. Leipzig: Teubner. 1920. Lehrbuch der Phonetik. Leipzig: Teubner. Johnson, C. D. 1972. Formal Aspects of Phonological Description. The Hague: Mouton. Jones, D. 1909. Intonation Curves. Leipzig: Teubner. 1940. An Outline of English Phonetics. Cambridge: Heffer. Joos, M. 1957. Readings in Linguistics 1. Chicago: University of Chicago Press. Joseph, B. D. and I. Warburton. 1987. Modern Greek. London: Croom Helm. Kahn, D. 1976. Syllable Based Generalizations in English Phonology. Bloomington,
IULC. Kaisse, E. M. 1985. Connected Speech: the Interaction of Syntax and Phonology. New York: Academic Press. Kawasaki, H. 1982. An acoustical basis for universal constraints on sound sequences. Doctoral dissertation, University of California, Berkeley. 1986. Phonetic explanation for phonological universals: the case of distinctive vowel nasalization. In J. J. Ohala and J. J. Jaeger (eds.), Experimental Phonology. Orlando, FL: Academic Press. Kay, B. A., K. G. Munhall, E. Vatikiotis-Bateson, and J. A. S. Kelso. 1985. A note on processing kinematic data: sampling, filtering and differentiation. Hashins Laboratories Status Report on Speech Research SR-81: 291-303. Kaye, J. 1988. The ultimate phonological units - features or elements? Handout for paper delivered to LAGB, Durham Spring 1988. Kaye, J., J. Lowenstamm and J.-R. Vergnaud. 1985. The internal structure of phonological elements: a theory of charm and government. PY 2: 305-28. Keating, P. A. 1983. Comments on the jaw and syllable structure. JPhon 11: 401-6. 1985. Universal phonetics and the organization of grammars. In V. Fromkin (ed.), Phonetic Linguistics: Essays in Honor of Peter Ladefoged. Orlando, FL: Academic Press. 1988a. Underspecification in phonetics. Phonology 5: 275-92. 1988b. The phonology-phonetics interface. In F. Newmeyer (ed.), Cambridge 435
References Linguistic Survey, vol. 1: Linguistic Theory: Foundations. Cambridge: Cambridge University Press. Kelly, J. and J. Local. 1986. Long domain resonance patterns in English. In International Conference on Speech Input/Output; Techniques and Applications. IEE Conference Publication 258: 304-9. 1989. Doing Phonology: Observing, Recording, Interpreting. Manchester: Manchester University Press. Kelso, J. A. S. and B. Tuller. 1984. A dynamical basis for action systems. In M. Gazzaniga (ed.), Handbook of Cognitive Neuroscience. New York: Plenum, 321-56. Kelso, J. A. S., E. Saltzman and B. Tuller. 1986a. The dynamic perspective on speech production: data and theory. JPhon 14: 29-59. 1986b. Intentional contents, communicative context, and task dynamics: a reply to the commentators. JPhon 14: 171-96. Kelso, J. A. S., K. G. Holt, P. N. Kugler, and M. T. Turvey. 1980. On the concept of coordinative structures as dissipative structures, II: Empirical lines of convergence. In G. E. Stelmach and J. Requin (eds.), Tutorials in Motor Behavior. Amsterdam: North-Holland, 49-70. Kelso, J. A. S., B. Tuller, E. Vatikiotis-Bateson, and C. A. Fowler. 1984. Functionally specific articulatory cooperation following jaw perturbations during speech: evidence for coordinative structures. Journal of Experimental Psychology: Human Perception and Performance 10: 812-32. Kelso, J. A. S., E. Vatikiotis-Bateson, E. L. Saltzman, and B. Kay. 1985. A qualitative dynamic analysis of reiterant speech production: phase portraits, kinematics, and dynamic modeling. JASA 11: 266-80. Kenstowicz, M. 1970. On the notation of vowel length in Lithuanian. Papers in Linguistics 3: 73-113. Kent, R. D. 1983. The segmental organization of speech. In P. F. MacNeilage (ed.), The Production of Speech. New York: Springer. Kerswill, P. 1985. A socio-phonetic study of connected speech processes in Cambridge English: an outline and some results. Cambridge Papers in Phonetics and Experimental Linguistics. 4. 1987. Levels of linguistic variation in Durham. JL 23: 2 5 ^ 9 . Kerswill, P. and S. Wright. 1989. On the limits of auditory transcription: a sociophonetic approach. York Papers in Linguistics 14: 35-59. Kewley-Port, D. 1982. Measurement of formant transitions in naturally produced stop consonant-vowel syllables. JASA 72.2: 379-81. King, M. 1983. Transformational parsing. In M. King (ed.), Natural Language Parsing. London: Academic Press. Kingston, J. 1990. Articulatory binding. In J. Kingston and M. Beckman (eds.), Papers in Laboratory Phonology I: Between the Grammar and the Physics of Speech. Cambridge: Cambridge University Press, 406-34. Kingston, J. and M. E. Beckman (eds.). 1990. Papers in Laboratory Phonology I: Between the Grammar and the Physics of Speech. Cambridge: Cambridge University Press. 436
References Kingston, J. and R. Diehl. forthcoming. Phonetic knowledge and explanation. Ms., University of Massachusetts, Amherst, and University of Texas, Austin. Kiparsky, P. 1979. Metrical structure assignment is cyclic. LI 10: 421^2. 1985. Some consequences of Lexical Phonology. PY 2: 85-138. Kiparsky, P. and C. Kiparsky. 1967. Fact. In M. Bierwisch and K. E. Heidolph (eds.), Progress in Linguistics. The Hague: Mouton. Klatt, D. H. 1976. Linguistic uses of segmental duration in English: acoustic and perceptual evidence. JASA 59: 1208-21. 1980. Software for a cascade/parallel formant synthesizer. JASA 63.3: 971-95. Klein, E. 1987. Towards a declarative phonology. Ms., University of Edinburgh. Kohler, K. J. 1976. Die Instability wortfinaler Alveolarplosive im Deutschen: eine elektropalatographische Untersuchung. Phonetica 33: 1-30. 1979a. Kommunikative Aspekte satzphonetischer Prozesse im Deutschen. In H. Vater (ed.), Phonologische Probleme des Deutschen. Tubingen: Gunther Narr, 13-40. 1979b. Dimensions in the perception of fortis and lenis plosives. Phonetica 36: 332-43. 1990. Segmental reduction in connected speech in German: phonological facts and phonetic explanations. In W. J. Hardcastle and A. Marchal (eds.), Speech Production and Speech Modelling. Dordrecht: Kluwer, 62-92. Kohler, K. J., W. A. van Dommelen, and G. Timmermann. 1981. Die Merkmalpaare stimmhaft/stimmlos und fortis/lenis in der Konsonantenproduktion und -perzeption des heutigen Standard-Franzosisch. Institut fur Phonetik, Universitat Kiel. Arbeitsberichte, 14. Koutsoudas, A., G. Sanders and C. Noll. 1974. On the application of phonological rules. Lg 50: 1-28. Kozhevnikov, V. A. and L. A. Chistovich. 1965. Speech Articulation and Perception (Joint Publications Research Service, 30). Washington, DC. Krakow, R. A. 1989. The articulatory organization of syllables: a kinematic analysis of labial and velar gestures. Doctoral dissertation, Yale University. Krull, D. 1987. Second formant locus patterns as a measure of consonant-vowel coarticulation. PERILUS V. Institute of Linguistics, University of Stockholm. 1989. Consonant-vowel coarticulation in continuous speech and in reference words. STL-QPSR 1: 101-5. Kruyt, J. G. 1985. Accents from speakers to listeners. An experimental study of the production and perception of accent patterns in Dutch. Doctoral dissertation, University of Leiden. Kubozono, H. 1985. On the syntax and prosody of Japanese compounds. Work in Progress 18: 60-85. Department of Linguistics, University of Edinburgh. 1988a. The organisation of Japanese prosody. Doctoral dissertation, University of Edinburgh. 1988b. Constraints on phonological compound formation. English Linguistics 5: 150-69. 1988c. Dynamics of Japanese intonation. Ms., Nanzan University. 1989. Syntactic and rhythmic effects on downstep in Japanese. Phonology 6.1: 39-67. 437
References Kucera, H. and W. N. Francis. 1967. Computational Analysis of Present-day American English. Providence RI: Brown University Press. Kuno, S. 1973. The Structure of the Japanese Language. Cambridge, MA: MIT Press. Kutik, E., W. E. Cooper, and S. Boyce. 1983. Declination of fundamental frequency in speakers' production of parenthetical and main clauses. JASA 73: 1731-8. Ladd, D. R. 1980. The Structure of Intonational Meaning: Evidence from English. Bloomington: Indiana University Press. 1983a. Phonological features of intonational peaks. Lg 59: 721-59. 1983b. Levels versus configurations, revisited. In F. B. Agard, G. B. Kelley, A. Makkai, and V. B. Makkai, (eds.), Essays in Honor of Charles F. Hockett. Leiden: E. J. Brill. 1984. Declination: a review and some hypotheses. PY 1: 53-74. 1986a. Intonational phrasing: the case for recursive prosodic structure. PY 3: 311-40. 1986b. The representation of European downstep. Paper presented at the autumn meeting of the LAGB, Edinburgh. 1987a. Description of research on the procedures for assigning Fo to utterances. CSTR Text-to-Speech Status Report. Edinburgh: Centre for Speech Technology Research. 1987b. A phonological model of intonation for use in speech synthesis by rule. Proceedings of the European Conference on Speech Technology, Edinburgh.
1988. Declination "reset" and the hierarchical organization of utterances. JASA 84: 530-44. 1990. Metrical representation of pitch register. In J. Kingston and M. Beckman (eds.), Papers in Laboratory Phonology I: Between the Grammar and the Physics
of Speech. Cambridge: Cambridge University Press, 35-57. Ladd, D. R. and K. Silverman. 1984. Vowel intrinsic pitch in connected speech. Phonetica 41: 41-50. Ladd, D. R., K. Silverman, F. Tolkmitt, G. Bergmann, and K. R. Scherer. 1985. Evidence for the independent function of intonation contour, pitch range, and voice quality. JASA 78: 435-44. Ladefoged, P. 1971. Preliminaries to Linguistic Phonetics. Chicago: University of Chicago Press. 1977. The abyss between phonetics and phonology. Chicago Linguistic Society 13: 225-35. 1980. What are linguistic sounds made of? Lg 56: 485-502. 1982. A Course in Phonetics, 2nd edn. New York: Harcourt Brace Jovanovich. Ladefoged, P. and N. Antonanzas-Baroso. 1985. Computer measures of breathy voice quality. UCLA Working Papers in Phonetics 61: 79-86. Ladefoged, P. and M. Halle. 1988. Some major features of the International Phonetic Alphabet. Lg 64: 577-82. Ladefoged, P. and M. Lindau. 1989. Modeling articulatory-acoustics relations: a comment on Stevens' "On the quantal nature of speech." JPhon 17: 99-106. 438
References Ladefoged, P. and I. Maddieson. 1989. Multiply articulated segments and the feature hierarchy. UCLA Working Papers in Phonetics 72: 116-38. Lahiri, A. and J. Hankamer. 1988. The timing of geminate consonants. JPhon 16: 327-38. Lahiri, A. and J. Koreman. 1988. Syllable weight and quantity in Dutch. WCCFL 7: 217-28. Lakoff, R. 1973. Language and woman's place. Language in Society 2: 45-79. Langmeier, C , U. Luders, L. Schiefer, and B. Modi. 1987. An acoustic study on murmured and "tight" phonation in Gujarati dialects - a preliminary report. Proceedings of the Eleventh International Congress of Phonetic Sciences 1: 328-31. Lapointe, S. G. 1977. Recursiveness and deletion. Linguistic Analysis 3.3: 227-65. Lashley, K. S. 1930. Basic neural mechanisms in behavior. Psychological Review 37: 1-24. Lass, R. 1976. English Phonology and Phonological Theory: Synchronic and Diachronic Studies. Cambridge: Cambridge University Press. 1984a. Vowel system universals and typology: prologue to theory. PY 1: 75-112. 1984b. Phonology: an Introduction to Basic Concepts. Cambridge: Cambridge University Press. Lea, W. A. 1980. Prosodic aids to speech recognition. In W. A. Lea (ed.), Trends in Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall. Leben, W. R. 1973. Suprasegmental phonology. Doctoral dissertation, MIT. 1976. The tones in English intonation. Linguistic Analysis 2: 69-107. 1978. The representation of tone. In V. Fromkin (ed.), Tone: a Linguistic Survey. New York: Academic Press. Lehiste, I. 1970. Suprasegmentals. Cambridge, MA: MIT Press. 1972. The timing of utterances and linguistic boundaries. JASA 51: 2018-24. 1975. The phonetic structure of paragraphs. In A. Cohen and S. G. Nooteboom (eds.), Structure and Process in Speech Perception. Heidelberg: Springer. 1980. Phonetic manifestation of syntactic structure in English. Annual Bulletin, University of Tokyo RILP 14: 1-28. Liberman, A. M. and I. G. Mattingly. 1985. The motor theory of speech perception revised. Cognition 21: 1-36. Liberman, M. Y. 1975. The Intonational System of English. Doctoral dissertation, MIT. Distributed in 1978 by IULC. Liberman, M. Y. and J. Pierrehumbert. 1984. Intonational invariance under changes in pitch range and length. In M. Aronoff and R. T. Oehrle (eds.), Language Sound Structure: Studies in Phonology Presented to Morris Halle. Cambridge, MA: MIT Press, 157-233. Liberman, M. Y. and A. Prince. 1977. On stress and linguistic rhythm. LI 8: 249336. Licklider, J. C. R. and G. A. Miller. 1951. The perception of speech. In S. S. Stevens (ed.), Handbook of Experimental Psychology. New York: John Wiley. Lieberman, P. 1967. Intonation, Perception, and Language. Cambridge, MA: MIT Press. 439
References Lindau, M. and P. Ladefoged. 1986. Variability of feature specifications. In J. S. Perkell and D. Klatt (eds.), Invariance and Variability of Speech Processes. Hillsdale, NJ: Lawrence Erlbaum. Lindblom, B. 1963. Spectrographic study of vowel reduction. JASA 35: 1773-81. 1983. Economy of speech gestures. In P. MacNeilage (ed.), The Production of Speech. New York: Springer. 1984. Can the models of evolutionary biology be applied to phonetic problems? In M. P. R. van den Broeke and A. Cohen (eds.), Proceedings of the Ninth International Congress of Phonetic Sciences, Dordrecht: Foris. 1989. Phonetic invariance and the adaptive nature of speech. In B. A. G. Elsdendoorn and H. Bouma (eds.), Working Models of Human Perception. London: Academic Press. Lindblom, B. and R. Lindgren. 1985. Speaker-listener interaction and phonetic variation. PERILUS IV, Institute of Linguistics, University of Stockholm. Lindblom, B. and J. Lubker. 1985. The speech homunculus and a problem of phonetic linguistics. In V. Fromkin (ed.), Essays in Honor of Peter Ladefoged. Orlando, FL: Academic Press. Lindsey, G. 1985. Intonation and interrogation: tonal structure and the expression of a pragmatic function in English and other languages. Doctoral dissertation, UCLA. Linell, P. 1979. Psychological Reality in Phonology. Cambridge: Cambridge University Press. Local, J. K. 1990. Some rhythm, resonance and quality variations in urban Tyneside speech. In S. Ramsaran (ed.), Studies in the Pronunciation of English: a Commemorative Volume in Honour of A. C. Gimson. London: Routledge, 282-92. Local, J. K. and J. Kelly. 1986. Projection and "silences": notes on phonetic detail and conversational structure. Human Studies 9: 185-204. Lodge, K. R. 1984. Studies in the Phonology of Colloquial English. London: Croom Helm. Lofqvist, A., T. Baer, N. S. McGarr, and R. S. Story. 1989. The cricothyroid muscle in voicing control. JASA 85: 1314-21. Lubker, J. 1968. An EMG-cinefluorographic investigation of velar function during normal speech production. Cleft Palate Journal 5.1. 1981. Temporal aspects of speech production: anticipatory labial coarticulation. Phonetica 38: 51-65. Lyons, J. 1962. Phonemic and non-phonemic phonology. International Journal of American Linguistics, 28: 127-33. McCarthy, J. J. 1979. Formal problems in Semitic phonology and morphology. Doctoral dissertation, MIT. 1981. A prosodic theory of nonconcatenative morphology. LI 12.3: 373-418. 1989. Feature geometry and dependency. Phonetica 43: 84-108. McCarthy, J. J. and A. Prince. 1986. Prosodic Morphonology. Manuscript to appear with MIT Press. McCawley, J. D. 1968. The Phonological Component of a Grammar of Japanese. The Hague: Mouton. 440
References Macchi, M. 1985. Segmental and suprasegmental features and lip and jaw articulators. Doctoral dissertation, New York University. 1988. Labial articulation patterns associated with segmental features and syllables in English. Phonetica 45: 109-21. McClelland, J. L. and J. L. Elman. 1986. The TRACE model of speech perception. Cognitive Psychology 18: 1-86. McCroskey, R. L., Jr. 1957. Effect of speech on metabolism. Journal of Speech and Hearing Disorders 22: 46-52. MacKay, D. G. 1972. The structure of words and syllables: evidence from errors in speech. Cognitive Psychology 3: 210-27. Maddieson, I. 1984. Patterns of Sounds. Cambridge: Cambridge University Press. Maeda, S. 1974. A characterization of fundamental frequency contours of speech. MIT Quarterly Progress Report 114: 193-211. 1976. A characterization of American English intonation. Doctoral dissertation, MIT. Magen, H. 1989. An acoustic study of vowel-to-vowel coarticulation in English. Doctoral dissertation, Yale University. Makkai, V. B. 1972. Phonological Theory: Evolution and Current Practice. New York: Holt, Rinehart, and Winston. Malikouti-Drachman, A. and B. Drachman. 1980. Slogan chanting and speech rhythm in Greek. In W. Dressier, O. Pfeiffer, and J. Rennison (eds.), Phonologica 1980. Innsbruck: Innsbrucker Beitrage zur Sprachwissenschaft. Mandelbrot, B. 1954. Structure formelle des textes et communication. Word 10: 1-27. Marslen-Wilson, W. D. 1984. Function and process in spoken word-recognition. In H. Bouma and D. G. Bouwhuis (eds.), Attention and Performance X: Control of Language Processes. Hillsdale, NJ: Lawrence Erlbaum. 1987. Functional parallelism in spoken word-recognition. In U. Frauenfelder and L. K. Tyler (eds.), Spoken Word Recognition. Cambridge, MA: MIT Press. Mascaro, J. 1983. Phonological levels and assimilatory processes. Ms., Universitat Autonoma de Barcelona. Massaro, D. W. 1972. Preperceptual images, processing time and perceptual units in auditory perception. Psychological Review 79: 124-45. Mehler, J. 1981. The role of syllables in speech processing. Philosophical Transactions of the Royal Society B295: 333-52. Mehler, J., J. Y. Dommergues, U. Frauenfelder, and J. Segui. 1981. The syllable's role in speech segmentation. JVLVB 20: 298-305. Mehrota, R. C. 1980. Hindi Phonology. Raipur. Menn, L. and S. Boyce. 1982. Fundamental frequency and discourse structure. Lg & Sp 25: 341-83. Menzerath, P. and A. de Lacerda. 1933. Koartikulation, Steuerung und Lautabgrenzung. Bonn. Miller, J. E. and O. Fujimura. 1982. Graphic displays of combined presentations of acoustic and articulatory information. The Bell System Technical Journal 61: 799-810. Mills, C. B. 1980. Effects of context on reaction time to phonemes. JVLVB 19: 75-83. 441
References
Mohanan, K. P. 1983. The structure of the melody. Ms., MIT and University of Singapore. 1986. The Theory of Lexical Phonology. Dordrecht: Reidel.
Monsen, R. B., A. M. Engebretson, and N. R. Vermula. 1978. Indirect assessment of the contribution of sub-glottal pressure and vocal fold tension to changes of fundamental frequency in English. JASA 64: 65-80. Munhall, K., D. Ostry, and A. Parush. 1985. Characteristics of velocity profiles of speech movements. Journal of Experimental Psychology: Human Perception and Performance 2: 457-74.
Nakatani, L. and J. Schaffer. 1978. Hearing "words" without words: prosodic cues for word perception. JASA 63: 234—45. Nathan, G. S. 1983. The case for place - English rapid speech autosegmentally. Chicago Linguistic Society 19: 309-16.
Nearey, T. M. 1980. On the physical interpretation of vowel quality: cinefluorographic and acoustic evidence. JPhon 8: 213-41. Nelson, W. 1983. Physical principles for economies of speech movements. Biological Cybernetics 46: 135-47. Nespor, M. 1988. Rithmika charaktiristika tis Ellinikis (Rhythmic Characteristics of Greek). Studies in Greek Linguistics: Proceedings of the 9th Annual Meeting of the Department of Linguistics, Faculty of Philosophy, Aristotelian University of Thessaloniki. Nespor, M. and I. Vogel. 1982. Prosodic domains of external sandhi rules. In H. van der Hulst and N. Smith (eds.) Advances in Non-Linear Phonology. Dordrecht: Foris. 1983. Prosodic structure above the word. In A. Cutler and D. R. Ladd (eds.), Prosody: Models and Measurements. Heidelberg: Springer.
1986. Prosodic Phonology. Dordrecht: Foris. 1989. On clashes and lapses. Phonology 6: 69-116. Nittrouer, S., M. Studdert-Kennedy and R. S. McGowan. 1989. The emergence of phonetic segments: evidence from the spectral structure of fricative-vowel syllables spoken by children and adults. JSHR 32: 120-32. Nittrouer, S., K. Munhall, J. A. S. Kelso, E. Tuller, and K. S. Harris. 1988. Patterns of interarticulator phasing and their relationship to linguistic structure. JASA 84: 1653-61. Nolan, F. J. 1986. The implications of partial assimilation and incomplete neutralisation. Cambridge Papers in Phonetics and Experimental Linguistics, 5.
Norris, D. G. and A. Cutler. 1988. The relative accessibility of phonemes and syllables. Perception and Psychophysics 45: 485-93.
O'Connor, J. D. and G. F. Arnold. 1973. Intonation of Colloquial English, 2nd edn. London: Longman. Ohala, J. J. 1974. Experimental historical phonology. In J. M. Anderson and C. Jones (eds.), Historical Linguistics, vol. II: Theory and Description in Phonology.
North-Holland, Amsterdam, 353-89. 1975. Phonetic explanations for nasal sound patterns. In C. A. Ferguson, L. M. 442
References Hyman, and J. J. Ohala (eds.), Nasalfest: Papers from a Symposium on Nasals and Nasalization. Stanford: Language Universals Project. 1976. A model of speech aerodynamics. Report of the Phonology Laboratory (Berkeley) 1: 93-107. 1978. Phonological notations as models. In W. U. Dressier and W. Meid (eds.), Proceedings of the Twelfth International Congress of Linguists, Vienna 1977. Innsbruck: Innsbrucker Beitrage zur Sprachwissenschaft. 1979a. Universals of labial velars and de Saussure's chess analogy. Proceedings of the Ninth International Congress of Phonetic Sciences, vol. II. Copenhagen: Institute of Phonetics. 1979b. The contribution of acoustic phonetics to phonology. In B. Lindblom and S. Ohman (eds.), Frontiers of Speech Communication Research. London: Academic Press. 1981a. Speech timing as a tool in phonology. Phonetica 43: 84-108. 1981b. The listener as a source of sound change. In C. S. Masek, R. A. Hendrick, and M. F. Miller (eds.), Papers from the Parasession on Language and Behavior. Chicago: Chicago Linguistic Society. 1982. Physiological mechanisms underlying tone and intonation. In H. Fujisaki and E. Garding (eds.), Preprints, Working Group on Intonation, Thirteenth International Congress of Linguists, Tokyo, 29 Aug.-4 Sept. 1982. Tokyo. 1983. The origin of sound patterns in vocal tract constraints. In P. F. MacNeilage (ed.), The Production of Speech. New York: Springer. 1985a. Linguistics and automatic speech processing. In R. De Mori and C.-Y. Suen (eds.), New Systems and Architectures for Automatic Speech Recognition and Synthesis. Berlin: Springer. 1985b. Around flat. In V. Fromkin (ed.), Phonetic Linguistics. Essays in Honor of Peter Ladefoged. Orlando, FL: Academic Press. 1986. Phonological evidence for top down processing in speech perception. In J. S. Perkell and D. H. Klatt (eds.), Invariance and Variability in Speech Processes. Hillsdale, NJ: Lawrence Erlbaum. 1987. Explanation in phonology: opinions and examples. In W. U. Dressier, H. C. Luschiitzky, O. E. Pfeiffer, and J. R. Rennison (eds.), Phonologica 1984. Cambridge: Cambridge University Press. 1989. Sound change is drawn from a pool of synchronic variation. In L. E. Breivik and E. H. Jahr (eds.), Language Change: Do we Know its Causes Yet? (Trends in Linguistics). Berlin: Mouton de Gruyter. 1990a. The phonetics and phonology of aspects of assimilation. In J. Kingston and M. Beckman (eds.), Papers in Laboratory Phonology I. Between the Grammar and Physics of Speech. Cambridge: Cambridge University Press, 258-75. 1990b. The generality of articulatory binding: comments on Kingston's "Articulatory binding". In J. Kingston and M. Beckman (eds.), Papers in Laboratory Phonology I. Between the Grammar and Physics of Speech. Cambridge: Cambridge University Press, 445-50. forthcoming. The costs and benefits of phonological analysis. In P. Downing, 443
References S. Lima, and M. Noonan (eds.), Literacy and Linguistics. Amsterdam: John Benjamins. Ohala, J. J. and B. W. Eukel. 1987. Explaining the intrinsic pitch of vowels. In R. Channon and L. Shockey, (eds.), In Honor of Use Lehiste. Use Lehiste Puhendusteos. Dordrecht: Foris, 207-15. Ohala, J. J. and D. Feder. 1987. Listeners' identity of speech sounds is influenced by adjacent "restored" phonemes. Proceedings of the Eleventh International Congress of Phonetic Sciences. 4: 120-3. Ohala, J. J. and J. J. Jaeger (eds.), 1986. Experimental Phonology. Orlando, FL: Academic Press. Ohala, J. J. and H. Kawasaki. 1984. Phonetics and prosodic phonology. PY \\ 113-27. Ohala, J. J. and J. Lorentz. 1977. The story of [w]: an exercise in the phonetic explanation for sound patterns. Berkeley Linguistic Society, Proceedings, Annual Meeting 3: 577-99. Ohala, J. J. and C. J. Riordan. 1980. Passive vocal tract enlargement during voiced stops. Report of the Phonology Laboratory, Berkeley 5: 78-87. Ohala, J. J., M. Amador, L. Araujo, S. Pearson, and M. Peet. 1984. Use of synthetic speech parameters to estimate success of word recognition. JASA 75: S.93. Ohala, M. 1979. Phonological features of Hindi stops. South Asian Languages Analysis 1: 79-87. 1983. Aspects of Hindi Phonology. Delhi: Motilal Banarsidass. Ohman, S. E. G. 1966a. Coarticulation in VCV utterances: spectrographic measurements. JASA 39: 151-68. 1966b. Perception of segments of VCCV utterances. JASA 40: 979-88. Olive, J. 1975. Fundamental frequency rules for the synthesis of simple declarative sentences. JASA 57: 476-82. Oiler, D. K. 1973. The effect of position in utterance on speech segment duration in English. JASA 54: 1235^47. O'Shaughnessy, D. 1979. Linguistic features in fundamental frequency patterns. JPhonl: W9-A5. O'Shaughnessy, D. and J. Allen. 1983. Linguistic modality effects on fundamental frequency in speech. JASA 74, 1155-71. Ostry, D. J. and K. G. Munhall. 1985. Control of rate and duration of speech movements. JASA 11: 640-8. Ostry, D. J., E. Keller, and A. Parush. 1983. Similarities in the control of speech articulators and the limbs: kinematics of tongue dorsum movement in speech. Journal of Experimental Psychology: Human Perception and Performance 9: 622-36. Otsu, Y. 1980. Some aspects of rendaku in Japanese and related problems. Theoretical Issues in Japanese Linguistics (MIT Working Papers in Linguistics 2). Palmer, F. R. 1970. Prosodic Analysis. Oxford: Oxford University Press. Perkell, J. S. 1969. Physiology of Speech Production: Results and implications of a
444
References quantitative cineradiographic study (Research Monograph 53). Cambridge, MA: MIT Press. Peters, S. 1973. On restricting deletion transformations. In M. Gross, M. H. Halle, and M. Schutzenberger (eds.), The Formal Analysis of Language. The Hague: Mouton. Peters, S. and R. W. Ritchie 1972. On the generative power of transformational grammars. Information Sciences 6: 49-83. Peterson, G. E. and H. Barney. 1952. Control methods used in a study of vowels. JASA 24: 175-84. Peterson, G. E. and I. Lehiste. 1960. Duration of syllable nuclei in English. JASA 32: 693-703. Pierrehumbert, J. 1980. The Phonetics and Phonology of English Intonation. Doctoral dissertation, MIT; distributed 1988, Bloomington: IULC. 1981. Synthesizing intonation. JASA 70: 985-95. forthcoming. A preliminary study of the consequences of intonation for the voice source. STL-QPSR 4: 23-36. Pierrehumbert, J. and M. E. Beckman. 1988. Japanese Tone Structure (Linguistic Inquiry Monograph Series 15). Cambridge, MA: MIT Press. Pierrehumbert, J. and J. Hirschberg. 1990. The meaning of intonation contours in the interpretation of discourse. In P. Cohen, J. Morgan, and M. Pollack (eds.), Plans and Intentions in Communication. Cambridge, MA: MIT Press, 271-312. Pike, K. L. 1945. The Intonation of American English. Ann Arbor: University of Michigan Press. Pollard, C. and I. Sag. 1987. Information-based Syntax and Semantics. Stanford: CSLI. Poon, P. G. and C. A. Mateer. 1985. A study of Nepali stop consonants. Phonetica 42: 39^7. Poser, W. J. 1984. The phonetics and phonology of tone and intonation in Japanese. Doctoral dissertation, MIT. Pulleyblank, D. 1986. Tone in Lexical Phonology. Dordrecht: Reidel. 1989. Non-linear phonology. Annual Review of Anthropology 18: 203-26. Pullum, G. K. 1978. Rule Interaction and the Organization of a Grammar. New York: Garland. Recasens, D. 1987. An acoustic analysis of V-to-C and V-to-V coarticulatory effects in Catalan and Spanish VCV sequences. JPhon 15: 299-312. Repp, B. R. 1981. On levels of description in speech research. JASA 69.5: 1462-4. 1986. Some observations on the development of anticipatory coarticulation. JASA 79: 1616-19. Rialland, A. and M. B. Badjime. 1989. Reanalyse des tons du Bambara: des tons du nom a Torganisation generate du systeme. Studies in African Linguistics 20.1: 1-28. Rietveld, A. C. M. and C. Gussenhoven. 1985. On the relation between pitch excursion size and pitch prominence. JPhon 13: 299-308.
445
References
Riordan, C. 1977. Control of vocal tract length in speech. JASA 62: 998-1002. Roach, P. 1983. English Phonetics and Phonology: a Practical Course. Cambridge: Cambridge University Press. Roca, I. 1986. Secondary stress and metrical rhythm. PY 3: 341-70. Roudet, L. 1910. Elements de phonetique generate. Paris. Rubin, P., T. Baer, and P. Mermelstein. 1981. An articulatory synthesizer for perceptual research. JASA 70: 321-8. Sagart, L., P. Halle, B. de Boysson-Bardies, and C. Arabia-Guidet. 1986. Tone production in modern standard Chinese: an electromyographic investigation. Paper presented at nineteenth International Conference on Sino-Tibetan Languages and Linguistics, Columbus, OH, 12-14 September 1986. Sagey, E. 1986a. The representation of features and relations in non-linear phonology. Doctoral dissertation, MIT. 1986b. On the representation of complex segments and their formulation in Kinyarwanda. In E. Sezer and L. Wetzels (eds.), Studies in Compensatory Lengthening. Dordrecht: Foris. Saiasnrv A
anH D
Pisnni
1Q85 Tnfprartinn nf VnnwIpHcrp cnnrrpc in cr»r»1r^n wr*rH
References Sereno, J. A., S. R. Baum, G. C. Marean, and P. Lieberman. 1987. Acoustic analysis and perceptual data on anticipatory labial coarticulation in adults and children. JASA 81: 512-19. Setatos, M. 1974. Phonologia tis Kinis Neoellinikis {Phonology of Standard Greek). Athens: Papazisis. Sharf, D. J. and R. N. Ohde. 1981. Physiologic, acoustic and perceptual aspects of coarticulation: implications for the remediation of articulatory disorders. In N. J. Lass (ed.), Speech and Language: Advances in Basic Research and Practice, vol. 5. New York: Academic Press, 153-247. Shattuck-Hufnagel, S. and D. H. Klatt. 1979. Minimal use of features and markedness in speech production: evidence from speech errors. JVLVB 18: 41-55. Shattuck-Hufnagel, S., V. W. Zue, and J. Bernstein. 1978. An acoustic study of palatalization of fricatives in American English. JASA 64: S92(A). Shieber, S. M. 1986. An Introduction to Unification-based Approaches to Grammar. Stanford: CSLI. Sievers, E. 1901. Grundzuge der Phonetik. Leipzig: Breitkopf and Hartel. Silverman, K. E. A. and J. Pierrehumbert. 1990. The timing of prenuclear high accents in English. In J. Kingston and M. Beckman (eds.), Papers in Laboratory Phonology I. Between the Grammar and Physics of Speech. Cambridge: Cambridge University Press, 72-106. Simada, Z. and H. Hirose. 1971. Physiological correlates of Japanese accent patterns. Annual Bulletin of the Research Institute of Logopedics and Phoniatrics 5: 41-9. Soundararaj, F. 1986. Acoustic phonetic correlates of prominence in Tamil words. Work in Progress 19: 16-35. Department of Linguistics, University of Edinburgh. Sprigg, R. K. 1963. Prosodic analysis and phonological formulae, in Tibeto-Burman linguistic comparison. In H. L. Shorto (ed.), Linguistic Comparison in South East Asia and the Pacific. London: School of Oriental and African Studies. 1972. A polysystemic approach, in proto-Tibetan reconstruction, to tone and syllable initial consonant clusters. Bulletin of the School of Oriental and African Studies, 35. 3: 546-87. Steele, S. 1986. Interaction of vowel Fo and prosody. Phonetica 43: 92-105. 1987. Nuclear accent Fo peak location: effects of rate, vowel, and number of following syllables. JASA 80: Suppl. 1, S51. Steele, S. and M. Y. Liberman. 1987. The shape and alignment of rising intonation. JASA 81: S52. Steriade, D. 1982. Greek prosodies and the nature of syllabification. Doctoral dissertation, MIT. Stetson, R. 1951. Motor Phonetics: a Study of Speech Movements in action. Amsterdam: North-Holland. Stevens, K. N. 1972. The quantal nature of speech: evidence from articulatoryacoustic data. In E. E. David, Jr. and P. B. Denes (eds.), Human Communication: a Unified View. New York: McGraw-Hill. 1989. On the quantal nature of speech. JPhon 17: 3^5. 448
References Stevens, K. N. and S. J. Keyser 1989. Primary features and their enhancement in consonants. Lg 65: 81-106. Stevens, K. N., S. J. Keyser, and H. Kawasaki. 1986. Toward a phonetic and phonological theory of redundant features. In J. S. Perkell and D. H. Klatt (eds.), Invariance and Variability in Speech Processes. Hillsdale, NJ: Lawrence Erlbaum. Strange, W., R. R. Verbrugge, D. P. Shankweiler, and T. R. Edman. 1976. Consonant environment specifies vowel identity. JASA 60: 213-24. Sugito, M. and H. Hirose. 1978. An electromyographic study of the Kinki accent. Annual Bulletin of the Research Institute of Logopedics and Phoniatrics 12: 35-51. Summers, W. V. 1987. Effects of stress and final consonant voicing on vowel production: articulatory and acoustic analyses. JASA 82: 847-63. Summers, W. V., D. B. Pisoni, R. H. Bernacki, R. I. Pedlow, and M. A. Stokes. 1988. Effects of noise on speech production: acoustic and perceptual analyses. JASA 84: 917-28. Sussman, H., P. MacNeilage, and R. Hanson. 1973. Labial and mandibular dynamics during the production of bilabial consonants: preliminary observations. JSHR 17: 397-420. Sweet, H. 1877. A Handbook of Phonetics. Oxford: Oxford University Press. Swinney, D. and P. Prather. 1980. Phoneme identification in a phoneme monitoring experiment: the variable role of uncertainty about vowel contexts. Perception and Psychophysics 27: 104-10. Taft, M. 1978. Evidence that auditory word perceptions is not continuous: The DAZE effect. Paper presented at the fifth Australian Experimental Psychology Conference, La Trobe University. 1979. Recognition of affixed words and the word frequency effect. Memory and Cognition 7: 263-72. Talkin, D. 1989. Voicing epoch determination with dynamic programming. JASA 85: S149(A). Thorsen, N. 1978. An acoustical analysis of Danish intonation. JPhon 6: 151-75. 1979. Interpreting raw fundamental frequency tracings of Danish. Phonetica 36: 57-78. 1980a. A study of the perception of sentence intonation: evidence from Danish. JASA 67: 1014-30. 1980b. Intonation contours and stress group patterns in declarative sentences of varying length in ASC Danish. ARIPUC 14: 1-29. 1981. Intonation contours and stress group patterns in declarative sentences of varying length in ASC Danish - supplementary data. ARIPUC 15: 13-47. 1983. Standard Danish sentence intonation - phonetic data and their representation. Folia Linguistica 17: 187-220. 1984a. Variability and invariance in Danish stress group patterns. Phonetica 41: 88-102. 1984b. Intonation and text in standard Danish with special reference to the abstract representation of intonation. In W. U. Dressier, H. C. Luschiitzky, O. E. 449
References Pfeiffer, and J. R. Rennison (eds.), Phonologica 1984. Cambridge: Cambridge University Press. 1985. Intonation and text in Standard Danish. JASA IT 1205-16. 1986. Sentence intonation in textual context: supplementary data. JASA 80: 1041-7. Touati, P. 1987. Structures prosodiques du Suedois et du Francais. Lund: Lund University Press. Thrainsson, H. 1978. On the phonology of Icelandic aspiration. Nordic Journal of Linguistics 1: 3-54. Trager, G. L. and H. L. Smith. 1951. An Outline of English Structure (Studies in Linguistics, Occasional Papers 3). Norman, OK: Battenburg Press. Trim, J. L. M. 1959. Major and minor tone-groups in English. Le Maitre Phonetique 111:26-9. Trubetzkoy, N. S. 1939. Grundzuge der Phonologic transl. Principles of Phonology. C. A. M. Baltaxe, 1969. Berkeley: University of California Press. Turnbaugh, K. R., P. R. Hoffman, R. G. Daniloff, and R. Absher. 1985. Stor>-vowel coarticulation in 3-year-old, 5-year-old, and adult speakers. JASA 77: 1256-7. Tyler, L. K. and J. Wessels. 1983. Quantifying contextual contributions to word recognition processes. Perception and Psychophysics 304: 409-20. Uldall, E. T. 1958. American "molar" r and "flapped" r. Revisto do Laboratorio de Fonetica Experimental (Coimbra) 4: 103—6. Uyeno, T., H. Hayashibe, K. Imai, H. Imagawa, and S. Kiritani. 1981. Syntactic structures and prosody in Japanese: a case study on pitch contours and the pauses at phrase boundaries. University of Tokyo, Research Institute of Logopaedics and Phoniatrics, Annual Bulletin 1: 91-108. Vaissiere, J. 1988. Prediction of velum movement from phonological specifications. Phonetica 45: 122-39. van der Hulst, H. and N. Smith. 1982. The Structure of Phonological Representations. Part I. Dordrecht: Foris. Vatikiotis-Bateson, E. 1988. Linguistic Structure and Articulatory Dynamics. Doctoral dissertation, Indiana University. Distributed by IULC. Vogten, L. 1985. LVS-manual. Speech processing programs on IPO-VAX 11/780. Eindhoven: Institute for Perception Research. Waibel, A. 1988. Prosody and Speech Recognition. London: Pitman; San Mateo: Morgan Kaufman. Wang, W. S-Y. and J. Crawford. 1960. Frequency studies of English consonants. Lg &Sp 3: 131-9. Warren, P. and W. D. Marslen-Wilson. 1987. Continuous uptake of acoustic cues in spoken word recognition. Perception and Psychophysics 43: 262-75. 1988. Cues to lexical choice: discriminating place and voice. Perception and Psychophysics 44: 21-30. Warren, R. M. 1970. Perceptual restoration of missing speech sounds. Science 167: 392-3. Wells, J. C. 1982. Accents of English 1: An Introduction. Cambridge: Cambridge University Press. 450
References Westbury, J. R. 1983. Enlargement of the supraglottal cavity and its relation to stop consonant voicing. JASA 73: 1322-36. Whitney, W. 1879. Sanskrit Grammar. Cambridge, MA: Harvard University Press. Williams, B. 1985. Pitch and duration in Welsh stress perception: the implications for intonation. JPhon 13: 381-406. Williams, C. E. and K. N. Stevens. 1972. Emotions and speech: some acoustical correlates. JASA 52: 1238-50. Wood, S. 1982. X-ray and model studies of vowel articulation. Working Papers, Dept. of Linguistics, Lund University 23. Wright, J. T. 1986. The behavior of nasalized vowels in the perceptual vowel space. In Ohala, J. J. and J. J. Jaeger (eds.), 1986. Experimental Phonology. Orlando, FL: Academic Press. Wright, S. and P. Kerswill. 1988. On the perception of connected speech processes. Paper delivered to LAGB, Durham, Spring 1988. Yaeger, M. 1975. Vowel harmony in Montreal French. JASA 57: S69. Yip, M. 1989. Feature geometry and co-occurrence restrictions. Phonology 6: 349-74. Zimmer, K. E. 1969. Psychological correlates of some Turkish morpheme structure conditions. Lg 46: 309-21. Zsiga, E. and D. Byrd. 1988. Phasing in consonant clusters: articulatory and acoustic effects. Ms. Zue, V. W. and S. Shattuck-Hufnagel. 1980. Palatalization of /s/ in American English: when is a /§/ not a /§/? JASA 67: S27.
451
Name index
Abbs, J. H., 11, 122 Albrow, K. H., 194 Alfonso, P., 26 Ali, L. H. 255 Allen, J., 116,328 Alstermark, M., 178 Anderson, L. B., 178 Anderson, S. R., 5, 27, 192, 200, 298 Antonanzas-Baroso, N., 304 Archangeli, D., 154, 198, 234 Aristotle, 227 Arvaniti, A., 4, 398 Baer, T., 26-7 Barney, H., 169 Barry, M., 205, 264, 267-8, 278 Beckman, M. E., 2-4, 64, 68, 77, 84, 87-90, 92-4, 111, 114, 118, 120, 122-3, 125-6, 193, 196, 324, 326-7, 331, 333-4, 342-3, 345, 368-9, 372-3, 375, 385-7, 389-91, 393, 396 Beddor, P. S., 179, 181 Bell-Berti, F., 26, 141 Benguerel, A. P., 311 Berendsen, E., 399 van den Berg, R., 125-6, 326, 331, 334-5, 359-61, 363^1, 366-7, 379, 384-5, 38891, 394 Bernstein, J., 210 Bernstein, N. A., 14 Bertsch, W. F., 168 Bever, T. G., 293 Bhatia, T. K., 311 Bickley, C , 304 Blank, M. A., 292-3 Bloomfield, L., 313 Blumstein, S. E., 136
Bolinger, D., 322, 325, 332 Botha, R. P., 191 Botinis, A., 399^03, 412, 414, 418 Boves, L., 5 Boyce, S., 64, 326 Broe, M., 4, 94, 149, 198 Bromberger, S., 196 Browman, C. P., 2-4, 9, 14, 16, 18, 20-7, 30, 42, 44, 56-67, 69-70, 87, 116, 122, 128, 136, 165, 190, 194, 199-200, 225, 257, 287-8, 314 Brown, R. W., 181 Bruce, G., 5, 90, 325 Bullock, D., 70 Byrd, D., 288 Carlson, L., 393 Carnochan, J., 196 Catford, J. C , 313 Chang, N - C , 328 Chiba, T., 170, 175 Chistovich, L. A., 141 Chomsky, N., 116, 149, 165, 167, 183, 1956, 216, 296-8, 398 Chowdhury, M., 235 Clements, G. N., 84, 150, 155, 158-9, 178, 183-7, 192, 198, 230, 262, 285, 343, 370, 381,389 Cohen, A., 322, 331, 339 Cohen, J., 63 Cohen, P., 63 Coleman, J., 192 Collier, R., 322, 331 Cooper, A., 121 Cooper, W. E., 213, 333-5 Costa, P. J., 280 Crawford, J., 176
452
Name index Gernsbacher, M. A., 292 Gimson, A. C, 143, 203-5, 219 Gobi, C, 91, 95 Goldsmith, J., 150, 166, 184, 192 Goldstein, L., 2-3, 9, 14, 16, 18, 20-7, 30, 35, 42, 44, 56-67, 69-70, 87, 116, 120, 122, 128, 136, 165, 179, 181, 190, 194, 199-200,225,288,314 Gracco, V., 122 Greene, P. H., 14 Gronnum, N. {formerly Thorsen), 328, 334, 346, 359, 365, 367, 388-9 Grosjean, F., 236 Grossberg, S., 70 Gunther, H., 257 Gussenhoven, C, 125-6, 326, 331, 334-5, 338, 339, 341, 347, 359-61, 363-4, 3667, 379, 384-5, 388-91, 394
Crompton, A., 294 Crystal, D., 325 Cutler, A., 2, 5, 180, 213, 290, 293-5 Dalby, J. M., 203 Daniloff, R., 136, 141 Dauer, R. M., 399-400, 418 Delattre, P., 172 Dev, A. T., 239 Diehl, R., 65 Dinnsen, D. A., 274 Dixit, R. P., 300 Dobson, E. J., 169 Docherty, G. J., 279 van Dommelen, W. A., 142 Dowty, D. R., 194 Drachman, B., 399-402, 406, 409, 414, 416, 418 Dunn, M., 260 Edman, T. R., 173 Edwards, J., 2, 3, 68, 87-9, 92, 111, 114, 120, 122-3, 125-6 Engebretson, A. M., 395 Erikson, D., 178, 395 Fant, G., 93, 100, 118, 144, 170, 175, 199, 227, 279 Feder, D., 256 Ferguson, C. A., 235 Feth, L., 136 Firth, J. R., 142, 192, 225, 261 Flanagan, J. L., 144 Fletcher, J., 2-3, 5, 68, 87-9, 92, 111, 114, 120, 122-3, 125-6 Folkins, J. W., 11 Foss, D. J., 292-3 Fourakis, M., 279, 399-^00, 418 Fowler, C. A., 14, 18, 26, 56, 69, 201, 225, 278 Francis, W. N., 239 Franke, F., 66 Frauenfelder, U., 293 Frederiksen, J. R., 256 Fromkin, V., 180, 292 Fry, D. B., 332 Fudge, E. C, 197 Fujimura, O., 2-3, 21, 31, 87, 89, 117-9, 174, 176, 369, 395 Fujisaki, H., 348, 368, 372, 379 Gamkrelidze, T. V., 176 Girding, E., 331, 348, 366 Gay, T., 64, 68, 201 Gazdar, G., 216
Hakoda, K., 380 Halle, M., 116, 118, 149, 151-2, 159, 167, 183, 195-6, 199, 216, 227, 279, 296-8, 398 Hammarberg, R. E., 141 Han, M. S., 380 Hankamer, J., 249, 252, 278 Hanson, R., 122 Hardcastle, W. J., 67, 264 Harris, K. S., 26, 141, 261 Harshman, R., 35 't Hart, J., 322, 325, 331-3, 339, 360 Harwood, D. A., 292-3 Haskins Laboratories, 3, 68-9, 122, 288 Hawking, S. W., 188 Hawkins S., 3, 5, 56, 59, 69 Hayes, B., 4, 151, 230, 280, 327, 398 Hebb, D. O., 10 Henderson, J. B., 260 Henke, W. L., 141 Hewlett, N., 128-9, 138^1, 143-5 Hiki, S., 395 Hirose, K., 171, 348, 372, 380, 395^6 Hirschberg, J., 334, 388-9 Hombert, J.-M., 5, 180 Honda, K., 395 Hooper, J. B., 228 Houlihan, K., 316 House, J., 5 Huggins, A. W. F., 294 van der Hulst, H., 332 Hunnicutt, M. S., 116 Huss, V., 333 Hyman, L. M., 84 Isard, S. D., 5
453
Name index Iverson, G. K., 317 Jackendoff, R., 393 Jaeger, J. J., 185 Jakobson, R., 118, 199, 227, 279 Jesperson, O., 66 Johnson, C. D., 191 Jones, D., 204 Joseph, B. D., 399^00, 414 Kahn, D., 122 Kajiyama, M., 170, 175 Kakita, Y., 395 Kawai, H., 372 Kawasaki, H., 167-9, 182, 256 Kay, B. A., 72 Kaye, J., 193, 196 Keating, P., 26, 59, 123, 283 Keller, E., 60, 69 Kelly, J., 5, 142, 190, 204, 210, 213, 226 Kelso, J. A. S., 10-11, 13-14, 18, 20, 27-8, 60, 65, 68-70, 123 Kenstowicz, M., 150 Kent, R. D., 136 Kerswill, P., 264, 267-8, 270, 278 Kewley-Port, D., 194 Keyser, S. J., 150 King, M., 191 Kingston, J., 3, 60, 65, 121, 177 Kiparsky, P., 236 Kiritani, S., 118 Klatt, D. H., 68, 116, 180, 190, 194-5 Klein, E., 200 Kluender, K., 65 Kohler, K. J., 4, 142-3, 224, 225 Koreman, J., 230 Koutsoudas, A., 191 Kozhevnikov, V. A., 141 Krakow, R. A., 181, 258 Krull, D., 129 Kubuzono, H., 125-6, 331, 345, 368-9, 3723, 379-80, 382, 391-4 Kucera, H., 239 Kuno, S., 369 de Lacerda, A., 139-40 Ladd, D. R., 94, 321, 325-6, 333^6, 342, 343, 346-9, 355, 385-6, 388, 392 Ladefoged, P., 31, 35-6, 93, 137, 141, 159, 191, 193-5, 282, 296-8, 304, 311-12, 314 Lahiri, A., 2, 229-30, 249, 252, 255-7, 274, 327 Lapointe, S. G., 191 Lashley, K. S., 10
Lass, R., 155, 160, 198,203,219 Lea, W. A., 116 Leben, W. R., 166, 185 Lehiste, L, 89, 172 Liberman, A. M., 26, 64, 83-^, 90, 178, 398 Liberman, M. Y., 326, 332, 335, 347, 354^5, 389, 391-3, 418 Licklider, J. C. R., 168 Lieberman, P., 129, 327 Lindau, M., 31, 193 Lindblom, B., 5, 65, 129, 143, 167-8, 286 Lindgren, R., 129 Lindsey, G., 328 Linell, P., 191 Linn, Q., 100 Local, J., 4, 142, 190, 196, 204, 210, 213, 216, 224^8 Lodge, K. R., 205 Lofqvist, A., 171 Lorentz, J., 170, 176 Lowenstamm, J., 196 Lubker, J., 168 Lyons, J., 142 McCarthy, J. J., 150, 155, 159, 178, 187, 192, 230 Macchi, ML, 87, 123 McCrosky, R. L., 169 McGowan, R. S., 129 MacKay, D. G., 294 McNeil, D., 181 MacNeilage, P. F., 14, 122 Maddieson, I., 176, 282 Maeda, S., 326, 335, 388 Magen, H., 27, 56 Malikouti-Drachman, A., 399-402, 406, 409, 414,416,418 Mandelbrot, B., 173 Marslen-Wilson, W. D., 2, 229, 231, 233, 237-8, 255-7 Mascaro, J., 5 Massaro, D. W., 294 Mattingly, I. G., 26, 280 Max-Planck Speech Laboratory, 239 Mehler, J., 293-^ Mehrota, R. C , 299 Menn, L., 326 Menzerath, P., 139-^0 Mermelstein, P., 27 Miller, G. A., 168 Miller, J. E., 31 Mills, C. B., 293 Mohanan, K. P., 159, 200, 203, 205 Monsen, R. B., 395
454
Name index Rossi, M., 4 Roudet, L., 66 Rubin, P., 27
Munhall, K., 9, 13-15, 17, 19, 23, 30, 68-70, 116, 121 Nakatani, L., 333 Nathan, G. S., 205 Neary, T. M., 35 Nelson, W., 60 Nespor, M., 94, 125, 334, 390, 399-403, 405, 414,418 Nittrouer, S., 71, 129, 136 Nolan, F., 2, 4, 261, 267, 280-8 Noll, C , 191 Norris, D. G., 293 Ohala, J. J., 65, 137, 167-8, 170, 172, 176-9, 181-9, 225, 247, 255-6, 286-7 Ohala, M., 296-8, 310-12 Ohde, R. N., 129, 137 Ohman, S., 26, 67, 173, 178, 201 O'Shaughnessy, D., 328 Ostry, D. J., 60, 68-70, 121 Otsu, Y., 372 Paccia-Cooper, J-M., 213, 333-4 Parush, A., 60, 69, 121 Perkell, S., 35, 67, 201 Peters, P. S., 191 Peters, S., 194 Peterson, G. E., 169 Pickering, B., 5 Pierrehumbert, J., 2-4, 64, 84, 90, 92-4, 97, 117-27, 193, 283, 324-7, 331-2, 334-5, 342-3, 345, 347-8, 354-5, 368-9, 372-3, 375, 385-93, 396 Pike, K., 325 Plato, 180 Pollard, C , 217 Port, R., 279 Poser, W., 368, 372-3, 385, 390, 393 Prather, P., 293 Prince, A., 230, 316, 332, 389, 398, 418 Pulleyblank, D., 153 Pullum, G. K., 191 Puppel, S., 398 Rialland, A., 185 Recasens, D., 5, 56 Repp, B. R., 129 Rietveld, T., 125-6, 326, 331, 334-5, 347, 359-61, 363-4, 366-7, 379, 384-5, 38891, 394 Riordan, C. J., 171, 301 Ritchie, R. W., 191 Roach, P., 204, 219 Roca, I., 398
Sag, I., 217 Sagart, L., 395 Sagey, E., 153, 160, 187, 198, 282 Saltzman, E., 9-10, 13-15, 17, 19-20, 23, 278, 30, 45, 60, 65, 69, 116, 122-3, 288 Samuel, A. G., 294 Sanders, A. G., 191 Sato, H., 372, 380 De Saussure, F., 227 Savin, H. B., 293 Sawashima, M., 118, 171, 395 Schaffer, J., 333 Schein, B., 162 Scherzer, J., 294 Schiefer, L., 296-9, 301, 304^5, 311-18 Schindler, F., 143 Scott, D., 213 Segui, J., 238, 203-4 Selkirk, E., 5, 83, 94, 125, 313, 370, 372-3, 375, 379, 390, 393 Sereno, J. A., 129 Setatos, M., 399-400, 403, 405, 414 Sharf, D. J., 129, 137 Shattuck-Hufnagel, S., 210 Shieber, S. M., 192, 197, 217 Shockey, L., 128, 138^1, 143-5 Shuckers, G., 136 Sievers, E., 66 Silverman, K., 84, 92 Simada, Z., 395-6 Smith, H. L., 398, 418 Smith, N., 332 Sorensen, J., 335 Spriggs, R. K., 199 Steele, S., 86, 178, 347 Steriade, D., 153, 162 Stetson, R., 65 Stevens, K. N., 136, 173, 175, 206-7 Strange, W., 173 Studdert-Kennedy, M., 129 Sudo, H., 368, 372 Sugito, M , 395 Summers, W. V., 75, 143 Sussman, H., 14, 122 Sweet, H., 204 Swinney, D. A., 293 Taft, M., 257 Talkin, D., 2-3, 90, 92, 99, 117-27 Tateishi, K., 5, 370, 372-3, 379, 393 Terken, J., 5
455
Name index Thorsen, N., see Gronnum Thrainsson, H., 185 Timmermann, G., 142 Touati, P., 331 Trager, G. L., 398, 418 Trubetzkoy, N. S., 176, 193 Tuller, B., 13-14, 18, 65, 123 Turnbaugh, K. R., 129 Tyler, L. K., 236 Uldall, E. T., 172 Uyeno, T., 380 Vaissiere, J., 119 Vatikiotis-Bateson, E., 86 Verbrugge, D. P., 173 Vergnaud, J-R., 151-2, 196 Vermula, N. R., 395 Vogel, I., 2, 94, 124-5, 334, 379, 390, 399403, 405, 414, 418
Vogten, L., 352 Voltaire, 176 Waibel, A., 116 Wall, R. E., 194 Wang, W. S-Y., 176 Warburton, I., 399, 414 Warren, P., 233, 237-8, 257 Watson, I., 5 Wessels, J., 236 Westbury, J. R., 58 Whitney, W., 162 Williams, B., 333 Wood, S., 31 Wright, J. T., 179, 270 Yaeger, M., 178 Zimmer, K. E., 178 Zue, V. W., 210
456
Subject index
abstract mass, see task dynamics accent, ch. 3passim, 109, 359-67, ch. 14passim kinematics of, 73, 105, 109 accentual lengthening, 84, 86 accent level rule, 372 nuclear, 109, 388, 392 word accent, 327; in Japanese, 382 phrasal accent, 120, 327 action theory, 14, 19 aerodynamics, 172 aerodynamic variables, 23 airflow, 137, 179 volume velocity of airflow, 144 affricates, 150 alliteration, 180 allophone separation, see coarticulation allophonic variation, 30, 90, 93, 117, 191, 272, 316 alveolar stops, 143 alveolar weakening, 283, 285 ambisyllabicity, 197 anapest, 180 articulatory setting, 19, 22, 57, 66, 137 articulatory syllable, see coarticulation articulatory synthesis, 29, 268 aspiration, ch. 12 passim; see also stops, voice-onset time assimilation, 4, 149-55, 158-9, 181, ch. 8 passim, ch. 10 passim; see also coarticulation assimilation site, 264 autosegmental representation of, 263 coda and onset assimilation, 197, 213 in German, 226 gradual, 264, 268-9 multiplanar and coplanar formulations of, 153-4
of nasality, 234 partial, 152, 159, 267n. partial assimilation in Kolami, 153 perception of, 268-76 of place, 278, 283-5; in English, 287 regressive assimilation in Hausa, 152 single-feature assimilation, 159 site, 264 total, 159 association domain (AD), 126, 337-47, 349-51, 356, 358 assonance, 180 autosegmental phonology, 119, 149, ch. 6 passim, 166-7, 177-9, 184, 191-2, 201, 230,235,261,288 autosegmental spreading, 152 autosegmental tier(s), 94^-5, 118, 152, 159, 164 autosegments, 166, 180, 183, 206, 285 left-to-right mapping, 166, 177 Bambara, 185-6 Bantu languages, 186 base of articulation, see articulatory setting baseline (for Fo), 326, 329-34, 396 Bengali, ch. 9 passim, 214, 327, 333n. boundary strength, 333-4; see also phrasing boundary tones, 322, 326-7, 339, 363-4, 389 breathy phonation, 93, 300, 303 breathy voiced stops in Hindi, ch. 12 passim casual speech, 30, 60 catathesis, see downstep cinefluorography, 35 coarticulation, 14, 23, 24, 56-8, 63-5, ch. 5
457
Subject index passim, 198, 202-3, 213-15, 218, 293; see also assimilation, coproduction accounts of; articulatory syllable, 141; feature spreading, 129, 141, 142, 154, 158, 162 anticipatory effects, 56 CV coarticulation, 129-30, 133, 138-9, 213-14 domain of, 141, 203 in connected speech, 135 measures of; allophone separation, 129, 133, 136, 140; target-locus relationship, 129, 135, 140 VV coarticulation, 178 cocatenation, 201-2, ch. % passim coda, see syllable cohort model of spoken word recognition, 229-33, 238n. compensation, see immediate compensation compositionality of phonological features, 199 compound accent rules, 372 connected speech processes, 135, 142, 264 consonant harmony, 162-3 contour segment, 150; see also segment coordination, temporal, 13, 56-8, 169, 171— 2, 174, 176, 177, 181, 184; see also task dynamics coproduction, 24, 26, 55, 57, 63, 67, 201, 278; see also coarticulation CV-skeleton (CV-core, CV-tier), 150-2, 158, 166-7, 177, 183, 230; see also autosegmental phonology
of syllables, 76 speech timing, 68 see also length, geminates Dutch, 325-6, 333n., ch. 14 passim, 389-90 dysarthic speech, 58 effector system, see task dynamics electroglottography (EGG), 93, 97, 99, 118 electromyography (EMG), 114, 267 electropalatography (EPG), 206, 207, 209, 213n., 263-73 emphasis, 348 English, 10, 35, 204, 207, 237-9, 258, 278, 280-1, 283, 325, 328, 333n. 336, 338n. 387, 389-91, 398, 408 English vowels, 35 Middle English, 169 nasalized vowels in, 238-9, 246-8, 255-6 Old English, 169 epenthesis, 151, 153, 191, 224, 279 in Palestinian, 151 Ewe, 370n. exponency, see phonetic exponency
damping, 15-18, 43, 89; see also task dynamics critical, 15, 18-20, 64, 87 Danish, 345, 361, 366 declarative phonology, 198, 202, ch. % passim declination, 117, 329-31, 346; see also downstep deletion, 202, 224 demisyllable, 294 dependency phonology, 191 devoicing, 142 diphone, 294 displacement, see task dynamics distinctive features, see features downstep, 95, 331, ch. 14 passim, ch. 15 passim; see also declination accentual, 335, 343-7, 351-2, 358, 366-7 in Japanese, ch. 15 passim phrasal, 335, 345-7, 350, 356-8, 366-7 duration as cue to stress, 333
F o , see intonation, pitch accent, pitch range features, 117-8, ch. 6 passim, 178-80, 196, 199, 203, 231, 262, 281-3, 294, 297-8, 311-18, 394 articulator-bound, 159 category-valued, 156 emergence of, 93 interdependencies among, 178 intrinsic content of, 177 n-ary valued, 156,316,318 natural classes of, 155, 161, 187, 296, 313 phonetic corelates of, 93, 174—6, 184, 194-7 suprasegmental, 189 feature asynchrony, 186, 187, 189 feature dependence, 155 feature geometry, ch. 6 passim, 178-9, 187 feature specification theory, 230 feature spreading, see coarticulation feature structures, 197, 200, 214-19, 224 fiberscope, 118 final lengthening, 77-9, 83, 84, 89, 110, 116; see also phrasing kinematics of, 77 final lowering, 116, 326n, 354, 364, 389; see also phrasing Firthian phonology, 184, 190, 192, 1 9 3 ^ , 198, 218-9 focus, pragmatic, 390-2 foot, 142, 191
458
Subject index French, 27, 294 fricatives, 129 spectra of, 129 voiceless, 171 fundamental frequency, (Fo), see intonation, pitch accent, pitch range gating task, 236, 237-9, 240, 242, 244^5, 250-1, 253, 257-8 hysteresis effects in, 237n., 256 geminate consonants, 150-1, 153, ch. 9 passim geminate integrity, 151 generalized phrase-structure grammar (GPSG), 156 generative phonology, 155; see also SPE German, 226, 326, 336 GEST, 29f., 288 gesture, ch. 1 passim, 27, 7If., 288, 298; see also task dynamics opening, 73-9 closing, 73-9 relation between observed and predicted duration of, 79-81 gestural amplitude, 68-70 gestural blending, 19, 30 gestural duration, 88-9, 120, 124 gestural magnitude, 105, 109, 111, 114, 116, 120, 124, 169 gestural overlap, 14, 22, 28, 289; see also coarticulation gestural phasing, 17, 18, 21, 44, 70, 72, 76 gestural score, 14, 17, 22, 27-9, 30f, 44-6, 52-5, 57, 60, 64, 165 gestural stiffness, 11, 16, 17, 19, 44, 68-70, 76; lower limit on, 80-2 gestural structure, 28 gestural truncation, 70-2 glottal configurations in [h] and [?], 93-4 glottal cycle, 9 3 ^ , 98, 100 graph theory, 157, 196-7, 210 Greek, 177, ch. 16 passim Gujarati, 309n. Hausa, 387-8 Hindi; stops in, 296, ch. 12 passim', see also stops iamb, 180 icebergs, 89, 174n. Icelandic, 185, 187 Igbo,185-6 immediate compensation, 10, 13 initial lengthening, 116 insertion, see epenthesis
integer coindexing, 218 intonation, 92, 116, ch. 13 passim, 327, 394 effects on source characteristics, 92 of Dutch, ch. 14 passim notation of, 326, ch. 14 passim intonational boundary, 109; see also boundary tones intonational locus, ch. 14 passim inverse filtering, 93, 95 Italian, 287, 370n. Japanese, 324, 331, 333n., 345, ch. 15 passim jaw, see mandible Kabardian, 283n. Kikagu, 185 Kolani, 153 Latin, 287 length (feature), ch. 9 passim lexical-functional grammar (LFG), 156 lexical items, mental representation of, 22931, ch. 9 passim, 255, 291 lexical stress, see stress long vowels, 150 Ludikya, 186; see also word games Luganda, 185 Maithili, 298 Mandarin, 395 mandible, 10-13, 87-8, 114, 122-3, 140, 144 durations of movements, 88-9 Marathi, 316 metrical boost (MB), 379-86 metrical phonology, 180, 257, 331, ch. 16 passim major phrase, 373, 379, 383-5, 392; see also phrasing formation of, 372 mass-spring system, see task dynamics minor phrase, 377, 392; see also phrasing formation of, 372 mora, 86, 324 stability of, 186 morphology and word recognition, 257 morphophonologization, 274 motor equivalence, see task dynamics movement trajectories, see task dynamics movement velocity, see task dynamics n-Retroflexion, 150, 162, 163 nasal prosodies, 178 nasalization, 178n., 179, 182, 243, 247n.; see also assimilation, nasality
459
Subject index nasality, 234 in English and Bengali, ch. 9 passim Nati, see n-Retroflexion Nepali, 309n. neutralization, 213, 231, 274 neutral attractor, see task dynamics neutral vowel, 65; see also schwa nuclear tone, 325 nucleus, see syllable
prosody, 91, 180, ch. 13 passim, 399-400 prosodic effects across gesture type, 120-2 prosodic structure, ch. 3 passim, 331-2, 338, 369, 394; effect on segmental production, 112 phrasal vs. word prosody, 91, 92, 94-7, 114-6, 121 Proto-Indo-European, 286 psycholinguistics, 2, 229, 254, 293-5 Punjabi, 297
object, see task dynamics obligatory contour principle, 166; see also autosegmental phonology onset, see syllable optoelectronic tracking, 72 overlap, 28, 30, 56, 57, 141, 289; see also coarticulation, task dynamics perturbation experiments, 11 phase, 72, 87, 3 1 3 ^ phase space, see task dynamics phoneme monitoring task, 292 phonetic exponency, 194, 202-3, 218-19, 225 phonetic representations, 118, 193^4, 224-5, 278, 315, 388 phonological domains, consistency of, 126 phonological representations, 118, 149, 200, 225, 285, 315-8, 388, 394; see also autosegmental phonology, Firthian phonology, metrical phonology abstractness of, 93, 229-30 monostratal, 193 phonology-phonetics interface, 3, 24, 58, 124-7, 193, 195, 205, 224-8, 394 phrase-final lengthening, see final lengthening phrase-initial lengthening, see initial lengthening phrasing, 73, 337, 366 effects on /h/, 109-14 effects on gestural magnitude, 111-4 phrase boundary effects, 110-14, 336-8, 363; see also phrasal vs. word prosody Pig Latin, 292; see also word games pitch accent, 322-6, 332, ch. 14 passim 361-7, 387-90; see also accent bitonal, 326 pitch change, 360 pitch range, 330, 348, 352, 356, 358, 387, 392-7; see also register shift, declination point attractor, see task dynamics prenasalized stops, 150 prevoiced stops, ch. 12 passim primary stress, 407-9; see also stress, secondary stress prominence, 348, 387, 392-3
rhythm, 398-402; see also stress realization rules, 124, 215-16, 386 redundancy, 173, ch. 9 passim register shift, 372, 380, 384-6; see also downstep, upstep relaxation target, see task dynamics reset, 335, 343-7, 349, 366, 380, 384f.; see also phrasing, pitch range, register shift as local boost, 343 as register shift, 343-7 retroflex sounds, 162; see also n-Retroflexion rewrite rules, 226 rhyme (rime), see syllable rule interaction, 191 rule ordering, 191 Russian, 278 Sanskrit, 317 schwa, 26, ch. 2 passim epenthetic, 53 in English, ch. 2 passim in French, 27 intrusive, 59 simulations of, 51, 52, 65 targetless, ch. 2 passim, 66 unspecified, 52, 53 Scots, 160 secondary articulation, 179 secondary stress, ch. 16 passim segment, 4, 128, 141, ch. 6 passim, ch. 7 passim, 198, 201, ch. 10 passim, ch. 11 passim boundaries, 181 contour segments, 150, 160, 282, 285 complex segments, 150, 160, 283, 285 segmental categories, 207 hierachical view of segment structure, 161 relation to prosodic structure, 94, 117-8 spatial and temporal targets in, 3 steady-state segments, 172-6 segment transitions, 172—4, 176, 181 Semitic verb morphology, 186 shared feature convention, 166 skeletal slots, 154; see also CV-skeleton
460
Subject index skeleton, see CV-skeleton slips of the tongue, see speech errors sonority, 68, 69, 83-5, 123-4 sonority space, 84, 86 sound change, 182, 286-7 The Sound Pattern of English (SPE), 116, 149, 155, 191-2, 195, 201, 296-7 speech errors, 180, 292, 294 speech rate 23, 65, 136, 264, 268-9 speech recognition, 294 speech style, 23, 129, 137, 144-5, 267, 362; see also casual speech speech synthesis, 29, 101, 102, 190, 192, 225, 268 speech technology, 116 stiffness, see task dynamics stops, ch. 5 passim, 171, 175-6, ch. 12 passim bursts, 131-3, 137, 140, 144, 172, 176 preaspirated stops in Icelandic, 185, 187 stress, 4, 120-3, 151, 191, 392, ch. 16 passim; see also accent accent, 332 contrastive, 91 effects of phrasal stress on /h/, 103-9 in English, 332 in Greek, 4, ch. 16 passim lexical stress, 399^02, 405-6, 417-19 rhythmic stress, 416, 418-19; see also rhythm word stress, 120-1 sub-lexical representations, 291, 295 Swedish, 325, 328, 400 syllabification, 294 syllable, 73-5, 84, 90, 109, 120-3, 142-3, 181, 191, 194, 293-4, 304, 324, 332-3, 360-3, 399-401, 408-13, 417 accented, 73-7, 84-6, 94, 324, 337-8, 341, 359-60 coda, 199, 213, 216-20, 228 deaccented, 94 durations of, 76, 77, 79f, 80-2 nucleus, 204, 213, 214 onset, 119, 191, 200, 213, 216-18 prosodic organization of, 83 rhyme (rime), 180, 200, 214 unaccented, 77 syllable target, 294 syntax, effects on downstep, 369, 379 syntax-phonology interaction, 369, ch. 15 passim; see also phrasing of English monosyllables, 214 validation using synthetic signals, 101, 102 Tamil, 333n. target parameter, 29, 139
targets, phonetic, ch. 2 passim, ch. 3 passim, 322, 325 targets, tonal, ch. 14 passim scaling of, 347-51 task dynamics, 3, ch. 1 passim, ch. 2 passim, 68-72; see also gesture, overlap abstract mass, 20 articulator weightings, 17 articulator motions, 46 articulator space, 12 body space, 12 coordination in, 23, 65 displacement, 16, 70, 72-3, 76, 80, 122 effector system, 12, 13 evidence from acquisition of skilled movements, 22, 24, 143 mass-spring system, 15, 17, 19, 87-9 motor equivalence, 10 movement trajectories, 10, 16 movement velocity, 70; lower limit on, 82 neutral attractor, 19, 22 object, 11 oscillation in task dynamics, 87 phase space, 18 point attractor, 15 relaxation target, 66 stiffness, 16, 17, 87 task dynamic model, 9, 27, 57, 69 task space, 12 tract variables, ch. 1 passim, 23, 27, 28f., 122-3, 124; passive, 29 tempo, 68, 79-84; see also speech rate, final lengthening temporal modulation, 89 temporal overlap, see overlap timing slots, 166, 177; see also CV-skeleton tip of the tongue (TOT) recall, 180-1 Tokyo X-ray archive, 31 tonal space, 329-31, 334 tone(s), 84, 116, 150-1, 185, 282, 326-9, ch. 14 passim, ch. 15 passim; see also downstep, upstep as a grammatical function, 186 boundary tones, 322, 326-7, 339, 363-4, 389 in Igbo, 185, 186 in Kikuyu, 185 and prosodic environment, 90, 91, 94 tone scaling, 84, 347-51, 395-7 tone spreading, 178, 336, 341 starred and unstarred, 326, 336 trace model of speech perception, 233n. tract variables, see task dynamics transformational generative phonology (TGP), see SPE
461
Subject index transillumination, 121 trills, 179 trochee, 180 tunes, 324; see also CV-skeleton underspecification, 26, 28, 54, 119, 200, 21619, 198, 228, 255, 257 unification grammar (UG), 156, 192 upstep, 348, 379, 382, 385 variation interspeaker, 68, 119 intraspeaker, 290 situation-specific, 290 contextual variation of phonetic units, 55
voice-onset time (VOT), 121-2, 131, 279, 296-8,311-12 voicing, strategies for maintaining, 58 vowel centralization, 129-30, 133, 135, 137; see also schwa, neutral vowel vowel harmony, 142, 152, 154, 178, 187 vowel space, 129, 169 Welsh, 333 within-speaker variability, 290 word games, 180, 181, 294 X-ray micro-beam, 32^0, 46-50, 62-3, 267 Yoruba, 326
462