PAPERS IN LABORATORY PHONOLOGY SERIES EDITORS: MARY E. BECKMAN AND JOHN KINGSTON
Papers in Laboratory Phonology I Between the Grammar and Physics of Speech
Papers in Laboratory Phonology I Between the Grammar and Physics of Speech EDITED BY JOHN KINGSTON Department of Modern Languages and Linguistics, Cornell University
AND MARY E. BECKMAN Department of Linguistics, Ohio State
University
The right of the University of Cambridge to print and sell all manner of books was granted bv Henry VIII in 1534. The University has printed and published continuously since 1584.
CAMBRIDGE UNIVERSITY PRESS NEW YORK
CAMBRIDGE PORT CHESTER MELBOURNE
SYDNEY
Published by the Press Syndicate of the University of Cambridge The Pitt Building, Trumpington Street, Cambridge CB2 1RP 40 West 20th Street, New York, NY 10011, USA 10 Stamford Road, Oakleigh, Melbourne 3166, Australia © Cambridge University Press 1990 First published 1990 British Library cataloguing in publication data
Between the grammar and physics of speech. -(Papers in laboratory phonology; 1) 1. Phonology I. Kingston, John II. Beckman, Mary E. III. Series 414 Library of Congress cataloguing in publication data
Between the grammar and physics of speech/edited by John Kingston and Mary E. Beckman. p. cm. -(Papers in laboratory phonology: 1) Largely papers of the First Conference in Laboratory Phonology, held in 1988 in Columbus, Ohio. Includes indexes. ISBN 0-521-36238-5 Grammar, Comparative and general — Phonology — Congresses. 2. Phonetics-Congresses. I. Kingston, John (John C.) II. Beckman, Mary E. III. Conference in Laboratory Phonology (1st 1986 Columbus, Ohio) IV. Series.' P217.847 1990 414-dc20 89-36131 CIP ISBN 0 521 36238 5 hard covers ISBN 0 521 36808 1 paperback
Transferred to digital printing 2004
Contents
List of contributors Acknowledgements 1
IO7
Macro and micro Fo in the synthesis of intonation KLAUS J. KOHLER
8
72
Alignment and composition of tonal accents: comments on Silverman and Pierrehumberf's paper GOSTA BRUCE
7
58
The timing of prenuclear high accents in English KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
6
35
The status of register in intonation theory: comments on the papers by Ladd and by Inkelas and Leben G. N. CLEMENTS
5
17
Metrical representation of pitch register D. ROBERT LADD
4
I
Where phonology and phonetics intersect: the case of Hausa intonation SHARON INKELAS AND WILLI AM R. LEBEN
3
x
Introduction MARY E. BECKMAN AND JOHN KINGSTON
2
page viii
115
The separation of prosodies: comments on Kohlerfs paper KIM E. A. SILVERMAN
139
CONTENTS
9
Lengthenings and shortenings and the nature of prosodic constituency MARY E. BECKMAN AND JAN EDWARDS
152
10 On the nature of prosodic constituency: comments on Beckman and Edwards1 s paper ELISABETH SELKIRK
179
11 Lengthenings and the nature of prosodic constituency: comments on Beckman and Edwards's paper CAROL A. FOWLER
201
12 From performance to phonology: comments on Beckman and Edwards's paper ANNE CUTLER
13
The Delta programming language: an integrated approach to nonlinear phonologyp, phonetics, and speech synthesis SUSAN R. HERTZ
14
208
215
The phonetics and phonology of aspects of assimilation JOHN J. OHALA
258
15 On the value of reductionism and formal explicitness in phonological models: comments on Ohalas paper JANET B. PIERREHUMBERT
276
16 A response to Pierrehumbert's commentary JOHN J. OHALA
17
280
The role of the sonority cycle in core syllabification G. N. CLEMENTS
283
18 Demisyllables as sets of features: comments on Clements's paper OSAMU FUJIMURA
19
Tiers in articulatory phonology, with some implications for casual speech CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
20
334
341
Toward a model of articulatory control: comments on Browman and Goldstein s paper OSAMU FUJIMURA
377
Contents 21
Gestures and autosegments: comments on Browman and Goldstein s paper DONCA STERIADE
22
382
On dividing phonetics and phonology: comments on the papers by Clements and by Browman and Goldstein PETER LADEFOGED
23
398
Articulatory binding JOHN KINGSTON
24
406
The generality of articulatory binding: comments on Kingston's paper JOHN J. OHALA
25
435
On articulatory binding: comments on Kingston's paper LOUIS GOLDSTEIN
26
445
The window model of coarticulation: articulatory evidence PATRICIA A. KEATING
27
Some factors influencing the precision required for targets: comments on Keating s paper
451 articulatory
KENNETH N. STEVENS 28
471
Some regularities in speech are not consequences of formal rules: comments on Keating's paper CAROL A. FOWLER
476
Index of names
491
Index of subjects
496
Contributors
MARY E. BECKMAN
Department of Linguistics Ohio State
CATHERINE P.BROWMAN
GOSTA BRUCE
University
Haskins Laboratories
Department of Linguistics and Phonetics Lund University
G. N. C L E M E N T S
Department of Modern Languages and Linguistics Cornell University
ANNE CUTLER
MRC Applied Psychology Unit
JAN EDWARDS
Hunter School of Health Sciences Hunter College of the City University of New York
CAROL A. FOWLER
Department of Psychology Dartmouth College
OSAMU F U J I M U R A
Linguistics and Artificial Intelligence Research AT&T Laboratories
LOUIS G O L D S T E I N
Departments of Linguistics and Psychology Yale University and Haskins Laboratories
SUSAN R. HERTZ
SHARON I N K E L A S
Bell
Department of Modern Languages and Linguistics Cornell University and Eloquent Technology Inc. Department of Linguistics Stanford
P A T R I C I A A. K E A T I N G
University
Department of Linguistics University of California, Los Angeles
JOHN KINGSTON
Department of Modern Languages and Linguistics Cornell University
KLAUS j . K O H L E R
Institut fur Phonetik und digitale Sprachverarbeitung Universitat Kiel
Contributors D. ROBERT LADD
Department of Linguistics University of Edinburgh
PETER LADEFOGED WILLIAM LEBEN JOHN j . OHALA
Department of Linguistics University of California, Los Angeles Department of Linguistics Stanford University
Department of Linguistics University of California, Berkeley
JANET B. PIERREHUMBERT
Linguistics and Artificial Intelligence Research AT & T Bell Laboratories
ELISABETH SELKIRK
Department of Linguistics University of Massachusetts
KIM E. A. SILVERMAN
Linguistics and Artificial Intelligence Research AT&T Bell Laboratories
DONCA STERIADE
Department of Linguistics and Philosophy MIT
KENNETH N. STEVENS
Speech Communication Group Research Laboratory of Electronics MIT
Acknowledgements
This collection of papers and the First Conference in Laboratory Phonology, where many of them were first presented, could not have come about without the generous support of various organizations. Funds for the conference were provided by several offices at the Ohio State University including the Office for Research and Graduate Studies, the College of Humanities, and the Departments of Linguistics and Psychology. The Ohio State University and the Department of Modern Languages and Linguistics at Cornell University also contributed materially to preparing this volume. Essential support of a different sort was provided by Janet Pierrehumbert, who coined the term "laboratory phonology" for us, and who gave us crucial encouragement at many stages in organizing the conference and in editing this volume of papers.
1 Introduction MARY E. B E C K M A N AND J O H N
KINGSTON
While each of the papers in this volume has its specific individual topic, collectively they address a more general issue, that of the relationship between the phonological component and the phonetic component. This issue encompasses at least three large questions. First, how, in the twin processes of producing and perceiving speech, do the discrete symbolic or cognitive units of the phonological representation of an utterance map into the continuous psychoacoustic and motoric functions of its phonetic representation ? Second, how should the task of explaining speech patterns be divided between the models of grammatical function that are encoded in phonological representations and the models of physical or sensory function that are encoded in phonetic representations? And third, what sorts of research methods are most likely to provide good models for the two components and for the mapping between them? Previous answers to these questions have been largely unsatisfactory, we think, because they have been assumed a priori, on the basis of prejudices arising in the social history of modern linguistics. In this history, phonology and phonetics were not at first distinguished. For example, in the entries for the two terms in the Oxford English Dictionary each is listed as a synonym for the other; phonology is defined as "The science of vocal sounds ( = PHONETICS)" and phonetics as "The department of linguistic science which treats of the sounds of speech; phonology." The subsequent division of this nineteenth-century "science of sounds" into the two distinct subdisciplines of phonology and phonetics gave administrative recognition to the importance of the grammatical function of speech as distinct from its physical structure and also to the necessity of studying the physical structure for its own sake. But this recognition was accomplished at the cost of creating two separate and sometimes mutually disaffected scientific subcultures. We can trace the origin of this cultural fissure to two trends. One is the everincreasing reliance of phonetic research on technology, rather than on just the analyst's kinesthetic and auditory sensibilities. This trend began at least in the first decade of this century, with the use of the X-ray to examine vowel production and
MARY E. BECKMAN AND JOHN KINGSTON
the adoption of the kymograph for examining waveforms. With such technical aids, phoneticians could observe the physical aspects of speech unfiltered by its grammatical function. With this capability, phonetics expanded its subject matter far beyond the taxonomic description of "speech sounds" found in phonological contrast, to develop a broader, domain-specific attention to such extra-grammatical matters as the physiology of speech articulation and the physics of speech acoustics, the peripheral and central processes of speech perception, and the machine synthesis and recognition of speech. The other trend that led to the separation of the two subdisciplines was the development of more complete formal models of the grammatical function of speech than are instantiated in the International Phonetic Alphabet. This trend had its initial main effect in the 1930s, with the emergence of distinctive feature theory, as elaborated explicitly in Prague Circle phonology (Trubetzkoy 1939) and implicitly in the American structuralists' emphasis on symmetry in analyzing phonological systems (Sapir 1925). Distinctive feature theory effectively shifted the focus of twentieth-century phonology away from the physical and psychological nature of speech sounds to their role in systems of phonemic contrast and morphological relatedness. Both of these trends undermined the alphabetic model that underlay the nineteenth-century synonymy between phonetics and phonology, but they did so in radically different ways. The analysis of "vocal sounds" into their component units of phonological contrast eventually led to new non-alphabetic representations in which phonological features were first accorded independent commutability in different rows of a matrix and then given independent segmentation on different autosegmental tiers. The use of new technology, on the other hand, questioned the physical basis originally assumed for alphabetic segmentation and commutability, by revealing the lack of discrete sequential invariant events in articulation or acoustics that might be identified with the discrete symbols of the IPA. These radically different grounds for doing away with a strictly alphabetic notation for either phonological or phonetic representations produced an apparent contradiction. Modelling the cognitive function of speech as linguistic sign requires two things: first, some way of segmenting the speech signal into the primitive grammatical entities that contrast and organize signs and second, some way of capturing the discrete categorical nature of distinctive differences among these entities. A direct representation of these two aspects of the grammar of speech is so obviously necessary in phonological models that it is hardly surprising that the early, rudimentary phonetic evidence against physical segmentation and discreteness should elicit the reaction that it did, a reaction caricatured in Trubetzkoy's declaration that "Phonetics is to phonology as numismatics is to economics." A more benign form of this prejudice recurs in the common assumption among phonologists that nonautomatic, language-specific aspects of phonetic repre-
Introduction
sentations and processes should share the discrete segmental nature of phonological symbols and rules. This apparent contradiction induced also a complementary prejudice on the part of phoneticians. Instrumentally-aided investigation of speech has resulted in decades of cumulative progress in phonetic modeling, including the monumental achievement of the acoustic theory of speech production (Fant 1960). A great deal of this research has necessarily been concerned with the details of mapping from one extra-grammatical system to another — for example, from acoustic pattern to cochlear nerve response or from motor excitation to articulatory pattern. This research into the relationships among different phonetic subcomponents has derived little direct benefit from advances in phonological theory. As a result, it has often been assumed that arguments about phonological representations and processes are irrelevant to the phonetic component as a whole, a prejudice that could be expressed in its most malignant form as "phonology is to phonetics as astrology is to astronomy." We have caricatured these prejudices at some length because we feel that they are a major impediment to answering our three questions concerning the relationship between phonology and phonetics. They distort our pictures of the two linguistic components and of the shape of the mapping between them. One set of theories describes the mapping as a trivial translation at the point where the linguistically relevant manipulations of discrete symbolic categories are passed to the rote mechanics of production and perception. Another set of theories places the dividing line at the point where the arbitrary taxonomy of linguistic units yields to experimentally verifiable models of speech motor control, aerodynamics, acoustics and perception. Such distortions are inevitable as long as the relegation of aspects of sound patterns between the two linguistic components is guided by unquestioned assumptions about what research methods are appropriate to which field. Therefore, we ask: how can we use the physical models and experimental paradigms of phonetics to construct more viable surface phonological representations? Conversely, what can we learn about underlying phonetic representations and processes from the formal cognitive models and computational paradigms of phonology ? Determining the relationship between the phonological component and the phonetic component demands a hybrid methodology. It requires experimental paradigms that control for details of phonological structure, and it requires observational techniques that go beyond standard field methods. The techniques and attitudes of this hybrid laboratory phonology are essential to investigating the large group of phonic phenomena which cannot be identified a priori as the exclusive province of either component. An example of such a phenomenon is fundamental frequency downtrend. It is a common observation that F o tends to fall over the course of an utterance. Phonologists have generally assumed that this downtrend belongs to the 3
MARY E. BECKMAN AND JOHN KINGSTON
phonological component. They have postulated simple tone changes that add intermediate tone levels (e.g. McCawley's 1968 rule lowering High tones in Japanese to Mid tone after the first unbroken string of Highs in a phrase), or they have proposed hierarchical representations that group unbroken strings of High tones together with following Lows in tree structures that are interpreted as triggering a downshift in tonal register at each branch (e.g. Clements 1981). Phoneticians, on the other hand, have typically considered downtrend to belong exclusively to the phonetic component. They have characterized it as a continuous backdrop decline that unfolds over time, independent of the phonological tone categories. They have motivated the backdrop decline either as a physiological artifact of decaying subglottal pressure during a "breath group" (e.g. Lieberman 1967), or as a phonetic strategy for defining syntactic constituents within the temporal constraints of articulatory planning (e.g. Cooper and Sorenson 1981). Each of these models is circumscribed by our notions about what research methods are appropriate to which linguistic subcomponent. If the observed downtrend in a language is to be in the province of phonological investigation, it must be audible as a categorical tone change or register difference, and its immediate cause must be something that can be discovered just by examining the paradigm of possible phonological environments. If the downtrend is to be in the province of phonetic investigation, on the other hand, it must be quantifiable as a response to some physically specifiable variable, either by correlating fundamental frequency point-by-point to subglottal air pressure or by relating fundamental frequency averages for syllables to their positions within phonologically unanalyzed utterances of varying length. Each sort of model accounts for only those features of downtrend which can be observed by the methods used. Suppose, however, that the downtrend observed in a given language is not a single homogeneous effect, or suppose that it crucially refers both to discrete phonological categories and to continuous phonetic functions. Then there will be essential features of the downtrend that cannot be accounted for in either model. Indeed these features could not even be observed, because the research strategy attributes downtrend a priori either to manipulations of phonological representations or to phonologically blind phonetic processes. In recent examples of the hybrid methods of laboratory phonology, Pierrehumbert has argued with respect to English and Poser and others (Pierrehumbert and Beckman 1988) regarding Japanese that downtrend is just such a heterogeneous complex of different components, many of which are generated in the mapping between phonological and phonetic representations. In both English and Japanese, certain phrase-final tones trigger a gradual lowering and compression of the pitch range as a function of the distance in time from the phrase edge. This component of downtrend is like the phonologically-blind declination assumed in earlier phonetic models in that it seems to be a gradual backdrop decline. Yet it is unlike them in that it refers crucially to phonological
Introduction
phrasing and phrase-final tone features. Also, in both languages, certain other, phrase-internal, tonal configurations trigger a compression of the overall pitch range, which drastically lowers all following fundamental frequency values within some intermediate level of phonological phrasing. This largest component of downtrend is like the intermediate tone levels or register shifts in earlier phonological models in that it is a step-like change triggered by a particular phonological event, the bitonal pitch accent. Yet it is unlike them in that it is implemented only in the phonetic representation, without changing the phonological specification of the affected tones. If these characterizations are accurate, then downtrend cannot be modeled just by reference to the phonological or the phonetic structure. Indeed neither of these two components of downtrend can even be observed without instrumental measurements of fundamental frequency values in experiments that control for phonological tone values and phrasal structures. The phenomenon of downtrend seems to require such hybrid methods. We think, moreover, that the list of phenomena requiring such hybrid methods and models is much larger than hitherto supposed. We believe that the time has come to undo the assumed division of labor between phonologists and other speech scientists; we believe this division of labor creates a harmful illusion that we can compartmentalize phonological facts from phonetic facts. At the very least, we maintain that the endeavor of modeling the grammar and the physics of speech can only benefit from explicit argument on this point. In support of this thesis, we present to you the papers in this volume. Most of these papers were first presented at a conference we held in early June of 1987 at the Ohio State University. To this conference we invited about 30 phonologists and phoneticians. The papers at the conference were of two sorts. We asked some of the participants to report on their own research or ideas about some phenomenon in this area between phonology and phonetics. We asked the other participants to present papers reacting to these reports, by showing how the research either did or did not consider relevant phonological structures or phonetic patterns, and by reminding us of other research that either supported or contradicted the results and models proposed. By structuring the conference in this way we hoped to accomplish two things. First, we wanted to show the value of doing research in this area between phonology and phonetics, and second, we wanted to provoke phonologists and phoneticians into talking to each other and into thinking about how the methods and aims of the two fields could be united in a hybrid laboratory discipline tuned specifically to doing this sort of research. After the conference, we commissioned both sets of participants to develop their presentations into the papers which we have grouped in this volume so that the commentary papers follow immediately upon the paper to which they are reacting. The specific topics that these groups of papers address fall into several large categories. First are papers which focus on suprasegmental phenomena in
MARY E. BECKMAN AND JOHN KINGSTON
language. One of these is the representation of tone and intonation. In this group, Inkelas and Leben examine tone patterns in Hausa, a language which is unlike English or Japanese in having a dense lexical assignment of tone. They ask what intonational patterns can exist in such a language and how such patterns should be represented. They show representative F o contours and mean F o values in a lexically and syntactically various set of Hausa sentences to argue for several intonational modifications of lexical tone patterns, including downdrift (a downstepping at each phrase-internal LH sequence) and a boosting of tones in the final phrase of a question. They propose that all of these modifications, including downdrift, are generated directly in the phonology by inserting tones on an autosegmental register-tone tier. These register tones are indirectly associated to tones on a primary tone tier by being linked to the same tonal nodes for minimal tone-bearing units, and they are interpreted in the phonetics as contextual modifications to the values for the corresponding primary tones. The boosting of tones in questions is represented as the effect of H register tones that link to tonal nodes in the final phrase, and downdrift is represented as the effect of a L that is inserted at each register slot corresponding to a primary L and then spread to following nodes for primary H tones. Inkelas and Leben argue that only this sort of directly phonological representation can explain the distribution of intonational effects in Hausa. For example, downdrift seems to be blocked in the last phrase of a question, just as would be expected if the posited register H here blocked the insertion of the downdrift-causing register L. Pierrehumbert and Beckman (1988), on the other hand, have suggested that some of these seemingly categorical dependencies might be artifacts of interactions among the effects of phonetic rules. For example, if the boosting of tones in the last phrase of a question were a gradual increase like the final raising attested in Tokyo Japanese, it could obliterate the effects of a phonetic downdrift rule without actually blocking its application. The experiments necessary to distinguish between these two accounts promise to enrich the phonological theory by the constraints on surface phonological representations that they might indicate. Ladd's paper also proposes a more directly phonological account of downtrend, in this case of downstepping intonations in English. In his model of shifting phrasal pitch registers, downstep is not a phonetic rule triggered by bitonal pitch accents, as proposed by Pierrehumbert (see Pierrehumbert 1980; Liberman and Pierrehumbert 1984; Beckman and Pierrehumbert 1986), but rather is the tonal interpretation of strength relationships among nodes in a prosodic constituent tree. In other words, the model formulates downstep as a global effect of tonal prominence, the abstract phonetic property that highlights one pitch-accented syllable over another. Ladd further argues that tonal prominence is generated and represented in the grammar as a pattern of relationships among nodes in a constituent tree. Thus, he rejects Pierrehumbert's (1980) argument against direct arboreal representations of downtrend; he proposes that downstep and other
Introduction
prominence relationships are instead generated in the grammar as the tonal counterpart of the utterance's stress pattern, and that a direct phonological representation of downstep is therefore not redundant to the phonetic mapping rules. By thus claiming that downstep is a phonological prominence relationship, Ladd's model predicts that there is no distinction between an accent that is merely downstepped relative to the preceding accent and one that is both downstepped and subordinated in focus. As Pierrehumbert and Beckman (1988) point out, this distinction is essential to the correct prediction of Japanese intonation patterns. It will be interesting to see whether the relevant production experiments show such a fundamental difference between English and Japanese. Thus, Ladd's proposal provides experimentally falsifiable predictions, a defining feature of research in laboratory phonology. It also motivates a new series of experiments on the perception and production of prominence relationships in English, and a new point of comparison between stress-accent and non-stress-accent patterns. Clements argues that there are devices other than a supraordinate phonological representation of register which may account as well for effects that Inkelas and Leben or Ladd attribute to their register tiers or trees. Clements also reminds us of the importance of striving for completeness and explicitness in models of phonetic implementation for tone structures. Another pressing question in the area of intonation is how to predict the alignment between a tone and its associated tone-bearing unit. This is especially problematic in languages such as English, where the minimal tone-bearing units (syllables) have highly variable lengths due to the many different contextual prosodic features that affect segmental rhythms, including syllable weight, stressfoot organization, and phrasal position. Silverman and Pierrehumbert investigate this question for English prenuclear single-tone (H#) pitch accents. They review the possible phonological and phonetic mechanisms that might govern the alignment of the accent's tone to the syllable, and derive a set of predictions concerning the precise timing of a related F o event that can differentiate among the various mechanisms. They then present experimental evidence suggesting that the relationship cannot be accounted for as an artifact of aligning a tonal gesture that is invariant at a given rate to the beginning of a syllable whose duration varies with prosodic context. Nor can it be modeled as the consequence of a directly rhythmic phonological manipulation that adds discrete increments to the syllable's duration after the tone's alignment point. Instead, the timing of the F o event seems to be a combination of two phonetic rules: a rule of timing on the tonal tier that shifts the prenuclear accent tone in a stress clash leftward so as to ensure a minimum distance between it and the following nuclear accent tones, and a rule of coordination between tones and segments that computes when an accent's dominant tone will occur by referring to the time course of the sonority contour of the associated accented syllable. Bruce's commentary compares Silverman and Pierrehumbert's findings to
MARY E. BECKMAN AND JOHN KINGSTON
similar data in Swedish which support their conclusion that coordination between tones and their associated syllables is described more accurately in the phonetic component than by building a direct phonological representation of the relevant rhythmic variation. He also points out that data such as these are relevant to the issue of tonal composition. The notation implies that the F o event corresponding to the starred tone of the pitch accent occurs within the accented syllable; yet the F o peak corresponding to the H* accent in Silverman and Pierrehumbert's experiment often occurred after the end of the stressed syllable. If the relevant F o event does not occur within the stressed syllable, should we revise our tonal analysis of the F o event ? Precisely how can we apply such phonetic data to the determination of phonological form ? Bruce concludes by suggesting that in stresstimed languages such as Swedish and English, coordination between tonal and segmental features is critical only at the boundaries of certain prosodic groupings larger than the syllable. Kohler's paper addresses a closely related issue in the relationship between stress and accent, namely the differentiation of accent placement and accent composition in their contribution to the F o pattern. This is especially problematic in languages such as English and German, where the tonal composition of the pitch accent is not invariant (as in Danish), or lexically determined (as in Swedish), but rather is selected from an inventory of intonational shapes. In both languages, the combination of lexically contrastive accent placement with the different possible tonal shapes for the accent results in the possibility of different intonation patterns having superficially identical F o patterns. For example, when an intonation pattern with a peak aligned toward the end of the accented syllable is produced on a word that has primary stress on its first syllable, it may yield an F o contour that is virtually indistinguishable from that which occurs when another intonation pattern with a fall from a peak aligned toward the beginning of the accented syllable is produced on a word that has primary stress on its second syllable. Kohler asks how listeners can interpret such seemingly ambiguous structures; what cues do they use? He addresses this question with a series of perception tests involving hybrid resynthesis whereby the F o pattern from one intonation is combined with the segment durations and spectra of the opposing accent pattern. Kohler then turns his attention to the interaction of the more global determinants of F o contours with the more microscopic influences on F o of segmental features; specifically the fortis/lenis contrast in obstruents. He proposes that these microprosodic effects are not discernible in all intonational contexts and that listeners therefore cannot always employ them as cues to this contrast. As Silverman observes in his commentary on this paper, the results provide strong support for a hierarchical organization in which less prominent syllables at the lowest levels are marked absolutely as unstressed by their rhythmic and spectral characteristics, independent of the intonation pattern. Thus, the experiments provide new evidence for the phonetic reality of levels of stress unrelated to accent,
Introduction
as maintained by phonologists. Regarding Kohler's data on the perception of segmental influences on F o , Silverman presents contrary evidence arguing against the notion that segmental microprosodies are obliterated by the intonation contour in which the segments are embedded. Beckman and Edwards's paper also addresses the relationship between intonation and stress, but from the point of view of prosodic constituency below the intonation phrase in English. They ask whether constituents within intonation phrases have overtly marked boundaries in addition to their clearly marked heads. They relate this question to the interpretation of two seemingly contradictory durational effects: the lengthening of syllables at the ends of phrases (indicative of edge-based prosodic constituency) and the shortening of stressed syllables in polysyllabic feet (indicative of head-based prosodic constituency). They present experiments designed to control for the second effect so as to examine the first effect alone. After demonstrating that there is a final lengthening internal to the intonation phrase, they attempt to determine its prosodic domain. Although the results provide no conclusive answer, they do suggest that prosodic constituency internal to the intonation phrase will not be adequately represented by a metrical grid. In the first of three commentaries on this paper, Selkirk reviews the phonological literature on the relationship between syntactic constituency and prosodic edges, and contrasts it to Beckman and Edwards's "prominence-based" theory of prosodic constituency. She points out that this literature yields crosslinguistic generalizations about how syntactic structure is mapped onto prosodic structure. The languages reviewed include some which are unlike English in not having its phonologically-dictated culminative prominence of lexical stress. Selkirk presents experimental evidence of her own for isomorphism in the syntaxto-prosody mapping in one of these - namely, Tokyo Japanese. Beckman and Edwards fail to characterize the domain of final lengthening in English, she says, because they have stated the question entirely in terms of phonological prominences rather than in terms of the edges defined in the mapping from syntax. She reinterprets their results as evidence for syntactically motivated constituents such as the Prosodic Word and the Phonological Phrase proposed in her own earlier tree-based work (e.g. Selkirk 1981, 1984). Fowler offers a different but related criticism of Beckman and Edwards. She points out that in distinguishing intonation phrase boundary effects from final lengthening internal to the phrase, Beckman and Edwards fail to control for syntactic constituency. She also reviews aspects of the literature on polysyllabic stress-foot shortening and stress clash that Beckman and Edwards neglect to consider, and in doing so indirectly raises another potential criticism of the experiments - namely, that if stress feet are bounded by word boundaries, as assumed in many versions of metrical phonology, then Beckman and Edwards have not really controlled for foot size. Finally she suggests an important
MARY E. BECKMAN AND JOHN KINGSTON
alternative interpretation of final lengthening that is skirted completely in the discussion of prominence-based constituents versus syntactic constituents. What if final lengthening is not a grammatical marking of linguistic constituents at all, but rather is a mere phonetic reflection of inertial braking at the boundaries of production units ? Should we then expect these production units to be necessarily phonological constituents ? More generally, how are we to know whether observed regularities indicate evidence of grammatical processes or of physical processes ? Cutler's commentary deals primarily with this last question and suggests possible grounds for answering it in comparisons of regularities in production and perception. Her concern addresses two levels at once: first, whether the phonological constituents proposed by Beckman and Edwards, if they exist at all, arise out of constraints on production or perception, and second, whether strong or weak psychological reality should be ascribed to them. If prosodic constituents can be shown to be actively involved in psychological processes of production (e.g. as a demonstrably necessary unit of planning), then they are psychologically real in the strong sense, but if the duration patterns evident in Beckman and Edwards's data are not necessary in planning or salient in perception, then the prosodic constituents are psychologically real only in the weak sense of being accurate generalizations of the effects. The next group of papers addresses the question of the relationship between phonological representations and phonetic structures more generally. Hertz addresses the problem primarily as the practical question of appropriate tools. Can a computational framework be provided that allows phonologists to implement phonological representations and rules in computer programs that can easily be linked to the synthesis of corresponding phonetic structures ? Her paper describes Delta, a programming language and synthesis rule-writing system that she has been developing for the last five years. She illustrates Delta by applying it to an autosegmental analysis of Bambara tone and a multilinear targets-and-transitions analysis of English formant patterns. While her paper emphasizes the practical advantages of Delta as a research tool, the system can be interpreted as making more far-reaching theoretical claims. The Delta structure makes no distinction between phonetic patterns and phonological patterns; it represents phonetic time functions as streams parallel to and essentially no different from the autosegmental tiers or organizational levels of the phonological representation for an utterance. Similarly, the system's rule framework makes no distinction among processes that relate one phonological level to another, processes that coordinate one phonetic time function to another, and processes that build phonetic structures on the basis of the phonological representation. A literal interpretation of these features of the Delta programming language amounts to the claim that there is an underlying unity between phonological and phonetic patterns which is more essential to the representation than any attested differences and incompatibilities between the two different types of linguistic representation. Directly synchronizing phonological
Introduction
tiers with phonetic time functions implies something like Sagey's (1986) position that autosegmental association means temporal co-occurrence, with the added insight that phonological association is in this respect no different from the temporal alignment and coordination among various phonetic subsystems. In this respect Delta implies the claims that are made more explicitly by Browman and Goldstein's articulatory phonology (see their paper, this volume). Ohala's paper also claims a fundamental unity between phonological and phonetic representations, but one of a somewhat different nature. Ohala notes that many of the commonly attested phonological patterns used to support notions such as natural classes of sounds or the existence of a finite set of distinctive features can be motivated by phonetic principles. In his principal illustration, Ohala demonstrates that the general preference for anticipatory over perseveratory assimilation rules for place of articulation in stops and nasals is rooted in the fact that the burst out of an articulatory closure is more salient as a cue to place than the formant transitions into the closure. Ohala argues that such phonetic motivations should be made to constrain the phonological description directly, by including in the phonological model the primitives and principles of aerodynamics, of the mapping from vocal tract shape to acoustic pattern, of peripheral auditory processing of the signal, and so on. He suggests that since such constraints are not an inherent or derivable property of autosegmental formalisms, autosegmental representations should simply be replaced by more general physical ones. Pierrehumbert's commentary on Ohala's paper argues for a greater separation of levels. She is sympathetic to the reductionist appeal of purely phonetic explanations for phonological patterns, but points out that a reductionist explanation is no better than the accuracy of the generalizations upon which it is based. In any case, Pierrehumbert argues, one still wants an explicit account of how the linguistic behavior is shaped by the physical and physiological constraints on production and perception. She recalls successful models that have been offered by phoneticians and phonologists, and points out that all of them, including autosegmental representations, share the feature of domain-specific computational explicitness. She also suggests that different domains of description in speech show behavior so complex that the relevant general principles guiding the behavior cannot be discovered without domain-specific formalisms. The last group of papers addresses various aspects of segmental organization and coordination among segmental tiers. Clements's paper examines segmental organization within and across syllables as it relates to the notion of sonority and the phonotactic constraints which refer to it. Clements proposes that the scalar nature of sonority does not mimic a corresponding phonetic scale (unlike, say, the auditory continua corresponding to degrees of vowel height or backness). Rather it is a phonological scale produced by implicational relationships among phonological manner features, by which they can be ordered hierarchically: [ — sonorant] -> [ — nasal] -> [ — approximant] -> [ — vocoid] -> [ — syllabic] to yield
MARY E. BECKMAN AND JOHN KINGSTON
five discrete degrees of relative sonority. Thus Clements would resolve the attested lack of a single simple phonetic correlate for sonority without abandoning the explanatory power of a universal phonetic basis. He then uses this definition of sonority to propose an account of syllabification in which initial and final demisyllables obey different sonority sequence constraints. Preferred syllables are those in which segment sequences are maximally dispersed with respect to this scale in their initial demisyllables, but minimally in their final ones. Preferred sequences of final and succeeding initial demisyllables are also numerically distinguishable with this scale from less preferred ones. This account leads to a metric for classifying a large variety of exceptional tautosyllabic sequences as more or less complex and hence more or less likely to occur. Fujimura points out that Clements's use of the term demisyllable for a unit governing sonority sequencing constraints among phoneme-sized segments is different from his own original proposal of the demisyllable as the basic segmental unit. In this context, he points out again the two observations that motivated his proposal: first, that the ordering of distinctive features within a demisyllable is redundant to a specification of their "affinity" to the vowel and second, that their production is governed by a set of phonetic rules for articulator timing and interarticulator coordination which differ fundamentally from the mere concatenation and assimilation of feature bundles across the demisyllable boundaries. He suggests that phonetic vowel affinity and articulatory coordination provide a more explanatory account of the syllable than can the phonological sequencing constraints in traditional accounts of sonority. Once examples of phonetic affixation are excluded, he says, Clements's exceptional cases do not violate our intuitions about sonority for events involving any single articulator. The issue of coordination among different articulations is addressed also by the other papers in this section. The first of these is Browman and Goldstein's introduction to their dynamical model of articulation. They propose that segmental features in production consist of dynamically-defined articulatory gestures arranged on physical tiers. At one level of abstraction, the gestures are discrete, invariant configurations for articulators that are functionally yoked, such as the lips and jaw in bilabial closures, where the configurations are defined by specifying the dynamic parameters of stiffness and target displacement in a springmass model. At a less abstract level of description, these discrete gestures determine the movement traces for single articulators. Coordination among the different tiers is described by specifying phase relationships amoRg the periods for the articulatory gestures. Browman and Goldstein first describe some phasing rules for overlapping vowel and target gestures and then apply the model to provide a unified account of the apparent elisions, assimilations, and substitutions characteristic of casual speech. In their dynamic representation of segments, these otherwise unrelated and unmotivated consequences of casual speech all fall out as a physically homogeneous group of adjustments of the gestures' timing or
Introduction
magnitude which alter the acoustic output without requiring actual loss or substitution of gestures. Thus, the model provides an elegant account of the casual speech phenomena as phonetic processes rather than as grammatical changes in the phonological representation. Fujimura points out that the quantitative description of phonological segments as time-varying elementary gestures for different articulators can be accomplished in several ways besides specifying dynamic parameters such as stiffness. Therefore, a given model cannot be evaluated just by showing that it is adequate for describing the local temporal structure of single events or pairs of contiguous events. It must also be tested by determining how well it predicts the occurrence of a gesture within the larger temporal organization, including the realization of prosodic structure such as phonological word or phrase boundaries. While he questions the necessity of adopting a task dynamic model for this purpose, he commends Browman and Goldstein's explicit quantitative formalism and especially their application of the formalism to provide an explanatory account of otherwise ad hoc segmental patterns in casual speech. He expresses the hope that this account will motivate phonologists to seriously consider articulatory data and to heed the insights of quantitative models such as theirs in formulating more abstract representations. Steriade's comments might be considered a fulfilment of Fujimura's hope. She contrasts the infinitely many degrees of overlap among gestures that can be represented in Browman and Goldstein's gestural score with the three degrees of overlap provided by standard autosegmental representations, which can distinguish only among a long autosegment (linked to two CV timing slots), a short autosegment (linked to one timing slot), and an extra-short autosegment (half of a contour specification linked to a single timing slot). She argues that the more conventional autosegmental representation is necessary to explain why only these three degrees of length are ever distinguished phonologically, but then suggests that something more like Browman and Goldstein's gestural score can better represent a class of apparent vowel epenthesis processes found in Winnebago, Romance, and Slavic, as shifting of a consonantal gesture with respect to the vocalic gesture of its syllable. She describes how a variable phase-angle specification of consonantal gestures with respect to vocalic ones might be incorporated into a feature-geometry account of these processes, although she notes that the latter does not reveal why these processes work just the way they do nearly as well as the gestural account does. By contrast, Ladefoged's review of the papers by Clements and by Browman and Goldstein argues that Fujimura's hope is misguided. Ladefoged reminds us that the aims of phonetics and phonology are different and often contradictory. Phonetic models such as Browman and Goldstein's aim to describe in predictive detail the mappings between different physical, physiological, or perceptual functions in the production or perception of specific utterances. Phonological 13
MARY E. BECKMAN AND JOHN KINGSTON
models such as Clements's on the other hand, must capture generalizations across languages in more abstract patterns of "speech sounds." The descriptive primitives necessary to these two different tasks do not translate directly into each other. Ladefoged shows that vowel features that are useful for predicting articulatory structure, for example, cannot be used to describe phonological processes such as umlaut or vowel shift. Thus, attempts to relate phonetic parameters directly to phonological features may be misguided. This position contrasts to the spirit of Steriade's comments, and is diametrically opposite to the relationship between phonetics and phonology assumed in the papers by Hertz and Ohala. Browman and Goldstein's model provides one possible answer to the two most pressing questions raised by autosegmental phonology: namely, the nature of the specification (and possible underspecification) of elements on each different autosegmental tier and the principles of coordination among the different tiers. Each of the last two papers in this group proposes an alternative answer to one of these two questions. Kingston attempts to account for two strong cross-linguistic asymmetries in the contrastive use and realization of glottal articulations - the markedly greater frequency of contrastive glottal articulations in stops than in continuants and the manifestation of these articulations at the release rather than at the closure of the stop - in terms of a functional principle governing the coordination between articulations specified on the laryngeal and supralaryngeal tiers in consonants. He proposes that these asymmetries arise from differences between stops and continuants in the "tightness" of "binding" between gestures on the two tiers. (Here, tight or loose binding is a metaphor for narrower or larger ranges of variation in timing relative to the other tier.) The binding principle specifies that the glottal gesture is more tightly bound to the salient acoustic effect of the supraglottal gesture, the burst at the stop release, for stops than it is for continuants. Tighter binding to release in stops ensures that manipulations of the airstream have the desired acoustic result and allows contrasts between different glottal articulations to be reliably conveyed. The specific prediction of the binding principle that glottal articulations will bind preferentially to a stop's release rather than its closure is tested by examining oral-glottal coordination in Icelandic preaspirated stops, a rare type of stop in which the opening of the glottis precedes the stop closure and thus may bind to it. Ohala emphasizes the functional motivation of the binding principle and suggests that, in general, binding must be relatively tight for all segments in order to fulfil the perceptual requirements of speech. Variation in the tightness of binding in Ohala's view would be a reflection of how long or short a stretch of the signal is required to convey the identifying properties of a segment. Goldstein points out several problems and counterexamples to the binding principle as stated. He suggests that these counterexamples can be better accounted for if the greater tightness of binding in stops is understood not as a
Introduction
requirement for their successful production, but rather in terms of the greater number of perceptual distinctions available. That is, different patterns of coordination produce spectrally distinct events in the case of stops, which languages can use to distinguish different stop types, whereas comparable differences in coordination for continuant segments do not produce comparable acoustic contrasts. Goldstein also observes that other principles of coordination which predict quite different schedules for glottal and oral articulations also appear to operate in obstruents, occurring singly or in clusters. Keating's paper addresses the other question raised by autosegmental phonology - namely, the nature of intra-tier coordination. She proposes an account of feature specification and gestural dynamics that is directly in the tradition of phonetic segments as a linear sequence of feature specification bundles, but offers a different formulation of the feature specification which allows her to avoid characterizing coarticulation as an entropic process. As in other, older segmental models of coarticulation, a phonetic feature is • represented in terms of target values along some physical scale (e.g. the position of an articulator along some spatial dimension). The novelty of Keating's model is that the target is not a single point, but instead is a more or less narrow range - a window - of points. As in the traditional models, segmental dynamics are modeled as an interpolation among the successive targets. Because the targets are windows rather than single points, however, coarticulatory variation need not be modeled as "undershoot" or artifactual "smearing" due to physiological inertia. Rather, it is a potentially informative computation of an optimal trajectory through the windows of successive segments. Stevens interprets Keating's proposal in terms of aerodynamic, acoustic, and perceptual constraints that dictate greater or lesser articulatory precision. He shows that if a speaker is to achieve a particular acoustic goal, in certain instances his articulation must at times exhibit greater precision within a segment than Keating's model would predict, in order to ensure the necessary acoustic discontinuities which allow the listener's task of dividing the signal into segments and then identifying them to be successful. In the transition to and from a segment, also, the necessary level of precision may vary for reasons other than the computation of an optimal path through its window in that context. In Fowler's commentary on Keating, she again raises the general question put forward in her reaction to Beckman and Edwards's paper. Should regularities in production necessarily be attributed to grammatical structures? More generally, how should we understand the nature of phonetic and phonological representations and their relationship to each other? She argues in more detail for her earlier position that phonetic regularities cannot be assumed to reflect grammatical structure. Responding to Keating's proposal, she suggests that it is a mistake to describe the patterns of greater or lesser articulatory precision in terms of such formal devices as feature specifications. 15
MARY E. BECKMAN AND JOHN KINGSTON
The papers in this volume thus represent a wide range of views on the issue of the relationship between phonology and phonetics. We trust that they also reflect the excitement and congenial argumentation that characterized the conference. And we hope that they will spark further inquiry into and discussion about topics in laboratory phonology. References Beckman, M. E. and J. B. Pierrehumbert. 1986. Intonational structure in Japanese and English. Phonology Yearbook 3: 255-310. Clements, G. N. 1981. The hierarchical representation of tone features. Harvard Studies in Phonology 2: 50-115. Cooper, W. E. and J. M. Sorenson. 1981. Fundamental Frequency in Sentence Production. New York: Springer-Verlag. Fant, G. 1960. The Acoustic Theory of Speech Production. The Hague: Mouton. Liberman, M. Y. and J. B. Pierrehumbert. 1984. Intonational invariance under changes in pitch range and length. In M. Aronoff and R. T. Oehrle (eds.) Language Sound Structure: Studies in Phonology Presented to Morris Halle. Cambridge, MA: MIT Press. Lieberman, P. 1967. Intonation, Perception, and Language. Cambridge, MA: MIT Press. McCawley, J. D. 1968. The Phonological Component of a Grammar of Japanese. The Hague: Mouton. Pierrehumbert, J. B. 1980. The phonology and phonetics of English intonation. Ph.D. dissertation, MIT. Pierrehumbert, J. B. and M. E. Beckman. 1988. Japanese Tone Structure. Cambridge, MA: MIT Press. Poser, W. J. 1984. The phonetics and phonology of tone and intonation in Japanese. Ph.D. dissertation, MIT. Sagey, E. 1986. The representation of features and relations in nonlinear phonology. Ph.D. dissertation, MIT. Sapir, E. 1925. Sound patterns in language. Language 1: 37-51. Selkirk, E. O. 1981. On the nature of phonological representation. In T. Myers, J. Laver, and J. Anderson (eds.) The Cognitive Representation of Speech. Amsterdam: North Holland 379-388 1984. Phonology and Syntax. Cambridge MA: MIT Press. Trubetzkoy, N. S. 1939. Grundzuge der Phonologie. C. A. M. Baltaxe (trans.) 1969. Principles of Phonology. Berkeley: University of California Press.
16
2 Where phonology and phonetics intersect: the case of Hausa intonation SHARON INKELAS AND WILLIAM R. LEBEN
1
Introduction
Recent studies have raised a number of questions about the nature and role of phonological representations in the analysis of intonation. One area of uncertainty involves the division of labor between the phonological component and phonetic implementation. Hausa shows that there is a phonology of intonation separate from the phonetics; attempts to frame the phonological generalizations in purely phonetic terms lead to loss of explanatory force. Expressing such intonational features as downdrift and 'key raising' (Newman and Newman 1981) as phonological not only explains their distribution more effectively but also reduces the size and perhaps the complexity of the phonetic implementation component. A second issue involves the nature of prosodic constituents, i.e. phonological domains, within which phonological rules apply. Hausa offers evidence of intonational phrases, prosodic constituents which, although they are constrained by syntactic constituent structure, do not mirror it exactly. Third, our work addresses the representation of tone features. This is still a subject of controversy in the description of lexical tone. And when we look at the representation of intonational melodies which are superimposed on lexical tone, the problem is compounded even further. In Hausa all syllables are specified for either a High or a Low tone by the time rules of intonation apply. What sorts of features do phonological rules of intonation insert, and where in the representation are they inserted?
2 Prosodic constituency We will address the question of constituency first. We have argued in the past (Leben et al. forthcoming; Inkelas et al. 1987) for the intonational phrase in Hausa on the basis of a number of phonological and phonetic effects that are bounded by phrases. A few of these are illustrated in (1) (Low tone syllables are marked in examples with a grave accent, while High tone syllables are unmarked). Both lines represent averages of ten tokens, all from the same speaker.1 The bottom line corresponds to the sentence uttered in normal declarative intonation;
SHARON INKELAS AND WILLIAM R. LEBEN
(1)
120 0
H
L 2
H
L 4
H
H
L 6
H
L
8 sylls
H
H 10
H
H 12
H
H 14
Maalam Nuhu / yaa hana lawan / hawaa wurin raanaa
the top line is the same sentence uttered as a Yes/No question. Vertical lines indicate phrase boundaries. The first phrase boundary is manifested by the interruption of the downward trending contour at the beginning of the second phrase; the second phrase boundary shows up most strikingly in the question, where it forms the domain in which High tones are raised above the level at which they would appear in a declarative, even when we abstract away from the more general phenomenon which causes questions to be uttered in an overall higher register than declaratives. This latter effect, which we have termed Global Raising, is easily detected in (1). Other more subtle effects of phrasing involve the application of Low Raising, an assimilatory process which raises the value of Low above what it would be if no High tone followed, and which can even make this Low higher than the following High. Low Raising applies within the first phrase of (1), as indicated by the arrow. (Although the example depicts the result of averaging over ten tokens, the same is true of the individual tokens.) But it does not apply across the first phrase boundary, where its environment is otherwise met. Similarly, the rule of High Raising, which raises a High tone before Low, applies to the first HL sequence within the second phrase of (1) but not, according to other of our data, across phrase boundaries. We have not investigated phrasing in Hausa systematically yet, but a number of observations suggest that syntax-based prosodic constituent accounts like those of Selkirk (1980, 1984, 1986), Nespor and Vogel (1982, 1986), Hayes (1984), and others are appropriate for Hausa. The theory of the prosodic hierarchy, within which the accounts just mentioned are articulated, posits a hierarchy of prosodic constituents, ranging in size from the syllable and foot all the way up to the intonational phrase and utterance-sized domains. Intonational phrases, as described by Selkirk and by Nespor and Vogel, typically contain a syntactic constituent or constituents which form a coherent semantic unit.
Where phonology and phonetics intersect
The Hausa phrases we have described fall into this pattern, as the parsing of a Hausa utterance into intonational phrases is constrained to some extent by the syntax. For example, there is typically a phrase boundary between the two objects of double object constructions. The Hausa version of "Audu described to the old man that new car of Mani's " has essentially the same constituent structure as in English: NP, VP, where VP consists of verb, indirect object, direct object. We have noticed in particular that the final syntactic constituent, the direct object, tends to constitute its own intonational phrase. A similar case of the syntax dictating the location of phonological phrase boundaries involves cleft constructions. In the Hausa sentence Lawan nee ya baa mil maamaakli "It's Lawan who surprised us," the initial constituent Lawan nee "It's Lawan" obligatorily forms a phrase by itself, acting as a domain for the phrasal rules we have mentioned. But intonational phrases - or prosodic constituents in general - do not necessarily match syntactic constituents.2 This is very easily demonstrated by the facts of emphasis in Hausa. We have argued elsewhere (Inkelas et al. 1987) that emphasized words in Hausa always begin a new phrase. Emphasis, which is accomplished by raising the first High tone in the emphasized word, regularly interrupts the downdrift contour. This, coupled with the inability of the phrasal Low Raising rule to apply between an emphasized word and what precedes, points to the presence of a phrase boundary. Emphasis can be added to most positions in a Hausa utterance. And if every emphasized word begins a new phrase, it follows that almost any sentence can be parsed differently into intonational phrases depending on which word is emphasized. Yet the syntactic constituent structure does not vary along with the prosodic structure.
3 Tone features We now turn to the second major area of interest, namely tone features and where they fit into a hierarchical representation of phonological features. Some intonational phenomena in Hausa can be described by the simple addition of a tone to the primary tier,3 that is to say, the tier on which is represented the lexical contrast between High and Low tone. An example is the Low tone optionally added to the end of Yes/No questions (Newman and Newman 1981) which, when added to a word that ends in a High tone, neutralizes the contrast between that word and one ending in a falling (HL) tone. As illustrated schematically in (2) below, in declaratives the word kai "you" contrasts lexically with kai "head." The one on the left has a High, while the one on the right has a fall, a High-Low sequence. When question intonation introduces a Low tone, this Low makes the tone pattern of kai "you" indistinguishable from the High-Low of kai "head" (Newman and Newman 1981).
SHARON INKELAS AND WILL I AM R. LEBEN (2)
Taare da kai
Taare da kai
"with you"
"with a head"
Declarative:
—
Interrogative:
—
-1
—
But as shown in Inkelas et al. (1987), it is impossible to describe all of Hausa intonation simply by adding tones to the primary tier. Yes/No questions also incorporate what Newman and Newman (1981) refer to as "key raising," a systematic upward shift in High and Low tone. Under key raising, a phrase-final Low tone is raised above the level of a non-final Low in questions, and this cannot be captured on the primary tier. As we argue below, part of the Hausa question morpheme is a High tone added to a secondary register tier, which raises the value of the final primary Low. (3)
Question morpheme:
L
Primary tier
o
Tonal node tier
I H
Register tier
Register High tones trigger an upward shift in the entire pitch register for the primary tones affiliated with them, a proposal identical in spirit to that of Hyman (1985, 1986),4 who introduces into the phonological representation register Low tone as the method for handling the systematic lowering of the pitch register in which certain High tones - that is, downstepped Highs - are realized. Hyman proposes to represent these downstepped Highs as primary High tones which are associated with a register Low, as in the following example. (We have added a tonal node in conformity with the other representations in this paper.) (4)
Downstepped High:
H
Primary tier
o
Tonal node tier
i L
Register tier
Later in the paper we will motivate the use of register Low for Hausa as well; like the primary tone tier, the register tier is thus home to both High and Low tones, which modify the realization of primary tone features by raising or lowering the register in which those tones are realized. 20
Where phonology and phonetics intersect
Following Archangeli and Pulleyblank (1986), Hyman and Pulleyblank (1987) and Inkelas (1987a, 1987b), we link both the primary and register tone tiers to an intermediate tier containing tonal nodes, thereby encoding the correspondence between primary and register tone. The further correspondence between particular tonal configurations and syllables can be found in the mapping between tonal node and syllable tiers. The geometry of this representation may not be transparent from the displays below. What we intend is for the two tone tiers (primary and register) to be parallel lines within a plane - which also contains the tonal node. That plane is orthogonal to the association line connecting the tonal node to the syllable node. (5)
* o Primary tone tier T
Tonal node tier T Register tone tier
In this model the question morpheme of (4), when linked to a syllable, otftains the following representation:
(6)
J o Primary tone tier
L
Tonal node tier H Register tone tier
By positing two different binary-valued tone features we will be able to express what needs to be expressed, namely that at the intonational level Hausa contrasts not only primary High tone with primary Low, but also raised with lowered versions of both tones. The particular two-tiered proposal we have made for Hausa is not to be confused with an approach recently devised by Pulleyblank (1986) to handle threetone languages like Yoruba.D Pulleyblank's features, [upper] and [raised] (p. 131), differ in two ways from the features we use. First, neither [upper] nor [raised] is interpreted as triggering a shift in the register, the function of our register feature. Instead, Pulleyblank's features carve out a four-way partitioning of a fixed pitch register. This system taken by itself is incapable of handling potentially unbounded sequences of register shifts, such as successive downsteps or for that matter downdrift, a characteristic of Hausa described in the next section. In order to capture these phenomena, Pulleyblank must add a new device, the tonal foot (p. 27), which appears to have no independent motivation and which, added to his tonal features, makes it possible in principle to describe lexical tonal systems with four distinct levels, each of which can have a downstepped counterpart. As far as we know, systems of this sort are unattested, and our two-tiered approach captures this fact nicely. Although Pulleyblank's system is thus richer than appears to be required by
SHARON INKELAS AND WILLIAM R. LEBEN
natural languages, it still fails to provide a mechanism for representing at least one phenomenon which we do find in a number of tone languages, namely floating tone melodies. This difficulty arises because Pulleyblank does not analyze the lexical opposition between High and Low tone in terms of a single binary contrast in one feature, e.g. [ +upper] vs. [ — upper]. Instead, the opposition is between [ +upper] and [ — raised]. By way of illustration (7) contains a HL lexical tone pattern, represented in each of the two models. (7)
With our features primary tier
H
With Pulleyblank's features L upper tier raised tier
[ + upper] [-raised]
Pulleyblank's system would fail to provide distinct representations for the floating tone melodies HL and LH since, to our knowledge, autosegmental phonology provides no way of sequencing a floating feature on one tier with respect to a floating feature on a different, independent tier. Yet Hausa morphology requires just such sequences, as Newman (1986) and Leben (1986) have shown.
4
Phonetics/phonology
We now turn to a third area of theoretical interest, namely the division of labor between phonetics and phonology. As shown in Inkelas et al. (1987), Hausa utterances are characterized by downdrift. By downdrift we mean the phenomenon whereby a primary Low tone triggers the lowering of the pitch register in which all following tones in the phrase are realized. This effect, which was also illustrated in (1), stands out in (8). Note (8)
180
H H L L H H L H L H L H L H L H L
Yaa aikaa wa Maanii / laabaarin wannan yaaron alaramma
that in the second phrase (to the right of the vertical line) each H following a L is lower than the H preceding the L. In Yes/No questions, by contrast, downdrift is suspended; as (9) shows, Hs following Ls are no lower than those preceding them in the final phrase of such questions.
Where phonology and phonetics intersect
O)
H H L L H H L H L H L H L H L H H 5 10 15 sylls
0
Yaa aikaa wa Maanii / laabaarin wannan yaaron alaramma?
The graph in (9) depicts the very same utterance shown in (8), the only difference being that (9) is uttered as a question,6 and we see that in the final phrase of the question, the second High of each HLH sequence is no lower than the first. In fact, all but the last High in an alternating sequence of Lows and Highs are realized at approximately the same level.7 This lack of downdrift is correlated with another effect which causes each High tone in the final phrase of a question to become higher than it would be in the corresponding declarative. There is no logical reason why either of these facts should entail the other. In fact, we show in Inkelas et al. (1987) that with a different type of raising, Global Raising (see (1)), downdrift is quite permissible. Thus in theory it would be possible for the final phrase of questions to be raised in F o with respect to the preceding phrase, yet with downdrift still applying across the phrase. But this is not what happens. Since phonetic implementation rules must permit downdrift to co-occur with Global Raising, they will have a hard time explaining the observed complementarity between downdrift and question raising. Suppose instead that we hypothesize that downdrift and question raising arise out of opposite specifications for a single phonological parameter, namely a register tone. In this case, the mutual incompatibility between downdrift and question raising will not be an accident. In our system, primary tones can be pronounced on one of two registers, Low or High. We assume that the Low register is neutral or unmarked and that High is marked. The rule for raising each primary High of the final phrase of questions is, simply: (10)
Register High Insertion: H
H Primary tier
o -»• o tonal node
I
H Register tier 23
SHARON INKELAS AND WILLIAM R. LEBEN
Downdrift, on the other hand, is triggered by the insertion of a register Low. The fact that primary Low is the phonological source of downdrift is captured by (11), which inserts register Low onto tonal nodes linked to primary Low. (11)
Register Low Insertion: L
L
i
i
o -+> o
Primary tier
tonal node
i L
Register tier
The fact that the effects of downdrift pass from primary Low to the following High is captured by the spreading of register Low in (12). (12)
Downdrift:
L H
Primary tier
I I o L
o
tonal nodes Register tier
The rules convert a representation like that of Ban sanii ba "I don't know" in (13) into (14). Because our representation is multidimensional, each step of the derivation has been depicted on two planes, one containing the syllables and tonal nodes, and the other containing the tonal nodes and tones.
(14)
Ban sani ba
1 1 1 ! O
0 0 0
L
HL H
I
I I I
O
0 0 0
L
L
The important result of this account is that as long as we assume that no single tonal node can be linked simultaneously to both a High and Low register tone, we can predict complementarity between High and Low register - and capture the fact that question raising and downdrift cannot co-occur. Other intonational effects that we now turn to provide additional motivation for our phonological proposal. 24
Where phonology and phonetics intersect
5 Emphasis One way of expressing emphasis in Hausa is to raise the first primary High tone in the emphasized phrase. An example is the F o contour of the utterance Ba mil neemoo Maanii ba! "We didn't go look for Mani!", where the verb neemoo is emphasized. The sentence has a LLHHHHH tone pattern. The large jump up between the last Low and first High is much greater than it would be in an unemphasized rendition of this sentence. As before, one point per syllable has been represented in the illustration. (15)
190
180 170
^ ^
,160
/
• 150 140 130 120 L 1
/ L
H
H 4 sylls
H
H
H
Ba mu neemoo Maanii ba!
The sentence illustrated in (16) is similar, except that the emphasized Hightoned word does not contain the initial High of the High sequence. In Bat biyaa Waali kudii ba "He didn't pay Wali money," emphasis occurs on Waali, which follows the High-toned verb biyaa "pay." We see a step up from Low to regular, downdrifted High and then a further jump up to the raised High of the emphasized word Waali. (16)
180 170 160
£150 140 130 120 sylls Bai biyaa Waali kucfii ba!
SHARON INKELAS AND WILLIAM R. LEBEN
We need to account for the fact that emphasis will raise the first High in the emphasized string, along with any adjacent High to the right in the same phrase. Suppose that we stipulate that a phrase boundary is inserted just before the emphasized word and that within the phrase the Obligatory Contour Principle (Leben 1978) causes any string of syllables with the same tone to be linked to a single tonal node dominating that tone. (17)
Let us assume further that raising for emphasis employs a register High tone. Then once this High tone attaches to a given primary High, all of the adjacent High-toned syllables will be raised as well, in accordance with our observations. The resulting phonological representation of Bai biyaa Waali kudu ba! is given below. (18)
a. Input to intonation rules: Waali kudii ba
H
I
o b. Output of intonation rules:
Now consider a case where the emphasized string is in a phrase containing a lexical Low. We would expect that after this Low the register will return to normal, since nothing in what we have proposed so far would lead the register High of emphasis to be assigned to primary Low. Thus, we would predict that primary Low will undergo the ordinary downdrift rule described above, and that this downdrifting process will go on to lower any subsequent High tone in the phrase. 26
Where phonology and phonetics intersect
This is exactly what happens. In eliciting data on sentences with this tone pattern, we explicitly asked speakers to drop down to a non-raised register following the emphasized word. For instance, in the example in (18) we asked them to lower the register after Waali. Speakers uniformly rejected this. But where the emphasized word was Audit, which has a HL melody, speakers were able to drop to the non-raised register for the rest of the phrase. Compare (16) above with (19a) below. The phonological representation for the latter utterance is given in (19b). (19)
a.
Bai biyaa Audu kucfii ba!
220 200 180 160 140 120 100
L H H H L H H H 1 2 3 4 5 6 7 8
b. Input to intonation rules:
c.
Output of intonation rules: Bai biyaa o
Audu kucfii ba
i
i
o H
H
L
o
I
I
0
0
i v I
A
6
o H
I
O
Ideophones
One additional case of raising involves High-toned ideophones. Certain emphatic particles in Hausa are always pronounced with an extra-High tone. Take for example the ideophone kalau meaning "extremely, very well." kalau is lexically High, meaning that both of its syllables are linked to a tonal node which is itself 27
SHARON INKELAS AND WILLIAM R. LEBEN
linked to a primary High. What makes it an ideophone is that it is also obligatorily linked to a register High: (20)
Ralau
V 0
tonal node
H 1
Primary tier
1
0 1 1
H
tonal node Register tier
We have already argued that in questions even regular Highs in the final phrase get register High. So if we are right about the OCP, and about register High accomplishing question raising as well as ideophone raising, we predict a neutralization between a regular High tone and a High ideophone in the final phrase of questions. Our data support this prediction. We performed an experiment in which we compared two Yes/No questions each containing two High-toned words in the final phrase. In one case the last word was the High ideophone Ralau; in the other it was the non-ideophone High-toned noun kawaa "sister." (21)
Yaa He
saami got
awarwaroo bracelet
Ralau well
(22)
Yaa He
saami got
awarwaron bracelet-of
Rawaa sister
(23) depicts the derivation of (21), the sentence with the ideophone. In (23b) we see the register High of question raising inserted on the High preceding the fully specified High ideophone; the resulting identical tonal configurations merge by the OCP in (23c) to produce a representation identical to that of the sentence without the ideophone. The derivation of the latter appears in (24).
(23)
a. Input to intonation rules: I
Yaa saami
Where phonology and phonetics intersect (23)
b. Register High insertion: Yaa saami
awarwaroo Ralau o
o H
Primary tier
o
tonal nodes
H
Register tier
OCP: Yaa saami
awarwaroo Ralau o H
Primary tier
o
tonal nodes
i H (24)
Register tier
a. Input to intonation rules: Yaa saami
awarwaron Rawaa o H
Primary tier
o
tonal nodes
I
Register tier b. Register High insertion: Yaa saami
awarwaron Rawaa O
H
Primary tier
o
tonal nodes
I
Register tier
The tones of the two utterances are virtually identical phonetically as well, as the following data demonstrate. Each row of the table in (25) represents the mean and variance of five tokens produced by the same speaker of one of the two utterances. One point was measured per syllable: the peak of High syllables and the valley of Low syllables. The data were recorded in three passes. Thefirsttwo passes are represented in the first two rows of each half of the table; on the third 29
SHARON INKELAS AND WILLIAM R. LEBEN
pass the speaker was asked to substitute the syllable ma for each syllable in the original sentences. This was done to factor out possible segmental effects on Fo. (25)
11:
Yaa saami awarwaroo k'sdau ?
H
L
H
H
187-9 166-3 171-6 89-7 157-6
137-5 11-5 129-2 28-2 116-2 26-3
150-4 98-6 144-3 25-9 124-9 32-5
1561
First (n = 9) Second (n = 9) Reiterant (n = 5)
Mean: Var: Mean: Var: Mean: Var:
12:
Yaa saami awarwaron k'awaa?
First (n = 10) Second (n = 10) Reiterant (n = 5)
Mean: Var: Mean: Var: Mean: Var:
16
7-6
H
157-9
158
161
192-8 67-2 189-2 78-8 174-5
269-5
20-1 155-8
6-2
21
4-4
5-5
5*5
H
169
125-5 21-6
143-7
153-1
122
142-3 85-2 121-6 12-3
26
H
152-8 19-3 142-3
H
56-1 116-3 27.8
H
151-6 10-9 139-5
L
25
H
15-3 153-5 16-1 141-3
H
132-6 160-9 107-7 167-9
H
8-9 143-5 29-3 137-6 14-8
291 143-3
H
H
H
153-9 14-8 145-1 35-7
155-5 23-6 148-3 54-7 142-5
155-9 17-9 149-5 60-3 145-8 18-2
142 11-9
24
69
140 246-6 36-1 249-4 249-9
H
H
187-9
253-4 176-2 238-8 211-4 239-1 506-1
104 178-7 38-6 180-8 105-3
Statistical tests showed no significant difference in the difference in F o between the final and penultimate syllables across the kalau and kawaa sentences; nor was there a significant difference in the percentage of increase in F o across those two syllables according to whether the final word was an ideophone or not.8 We thus confirm a prediction of our analysis, namely that since only one device, the register High, is responsible for non-gradient phonological raising, then whether we have one factor or simultaneously more than one factor conditioning raising, the amount of raising should not vary. It is not clear that we could make this same prediction with potentially additive phonetic implementation rules. Moreover, we have shown that raising caused by register High tones is in complementary distribution with the lowering of register Lows, another regularity which we would not necessarily expect of phonetic rules. One effect we have mentioned, Global Raising, co-occurs with downdrift as well as with register raising, and thus must not result from the imposition of a register High tone. We thus hypothesize that Global Raising results from the application of a phonetic implementation rule. 30
Where phonology and phonetics intersect
7 The phonology of register Before concluding our discussion, let us briefly consider how one might account for the facts of question raising, emphasis, and ideophones in a model which does not incorporate the register tier. One obvious possibility is to use a special tone on the primary tier, along the lines of the phrasal High proposed for English by Pierrehumbert (1980) and for Japanese by Pierrehumbert and Beckman (1988). Such a tone, inserted into the primary tier, would signal that the phonetic rules should raise the register of its fellow primary tones. But consider the tasks such a tone would have to perform in Hausa. In the preceding sections, we have identified three possible types of phonological raising which can apply to a HLH primary tone sequence belonging to a single phrase. As (26) illustrates, only the first of these Highs can be raised, as is the case under emphasis; or only the last can be raised, if it is a High ideophone. The third possibility is for both Highs to be raised, and this is exactly the situation we find in the final phrase of Yes/No questions. (26)
a. First High is raised: HLH Audit nee? " I t is Audu?" b. Last High is raised: HL H cfaa jaa wur "very red s o n " c. Both Highs are raised: HLH Audu nee? " I s it A u d u ? "
One can imagine that placing a phrasal High tone at the left of the phrase might cause the first primary High in the phrase to be raised, or that placing a phrasal High at the right edge would produce a similar effect on the final primary High. But we see no way for a phrasal High to perform either one of these types of local raising and still serve as the trigger for question raising, in which every High in the phrase is affected. We would be forced to posit another phrasal tone for this latter effect, adding unnecessary tools to our phonological inventory - and losing the generalization captured in the register approach that the raised Highs in all types of raising are phonologically identical.
Conclusion A tonal system containing only one binary-valued tone feature for which all tonebearing units are specified is clearly too impoverished to describe the range of phonological intonational effects we have observed in Hausa. For example, no single binary-valued feature could capture the four-way phonological contrast observed in (26) among raised High, lowered High, raised Low and lowered Low tones. It is worth noting here, however, that to describe multiple tonal contrasts of this sort in other languages, paths different from the one we took in this paper have been proposed. One alternative route that has been taken in the literature is the
SHARON INKELAS AND WILLIAM R. LEBEN
use of a single binary-valued phonological feature9 to describe complex tonal contrasts in other languages. This approach, by permitting the surface specifications + , — and 0 for a single tone feature, enables one to express three contrasting pitch heights. Obviously, however, this is insufficient for Hausa, which requires four phonological pitch heights, as shown in (27). (27)
a. H
b. L
c. H
d. L
I
I
I
0
0
0
I
I
I
I
I
H
H
L
L
0
Another logically possible route would be to employ an n-ary valued feature for tone height, as Stahlke (1975) proposed for Igede.10 However, such a proposal is not well equipped to deal with the phonological regularities that we have documented in this paper. If we translated the four tone heights in (27) into a fourvalued system, we would presumably represent (a) as 1, (b) as 2, and so forth. In a system of this sort, downdrift could not be expressed phonologically, as we have argued that construing downdrift as a phonetic rule would fail to explain the complementarity between downdrift and register raising. What we have done is to show that the framework of Hyman (1986) and Inkelas (1987#), which was developed for the description of lexical tonal contrasts in systems with four phonological tone heights, can express the phonology of intonation in what is essentially a two-tone language. This is an interesting result, since it predicts that in a four-tone language, the rules of intonation will not create new phonological entities; instead, we would expect them either to have a neutralizing effect (as Inkelas 1987a suggests for Chaga) or to be confined to the phonetic component, where they would not be expected to exhibit the sorts of discrete, categorial behavior that we observe in Hausa.
Notes We are grateful to Mary Beckman, Bruce Hayes, John Kingston, Janet Pierrehumbert and Bill Poser for their comments and suggestions, and to our commentators at the Laboratory Phonology conference, Nick Clements and David Odden. We owe a special debt to Abdullahi Bature, Habib Daba and Lawan Danladi Yalwa for helping us to gather the data and for providing their intuitions about Hausa intonation. None of these people, however, should be held responsible for the content of the paper. The research we report on was supported by a grant from the National Science Foundation, no. BNS86-09977. 1 We measured one point per syllable, taking the peak of High-toned syllables and the valley of Low ones, avoiding areas of perturbation caused by consonants in the F o contour. Although we have not complicated the diagram by indicating standard error, we computed that statistic for each set of averages and found it to be consistently small, 32
Where phonology and phonetics intersect
2 3 4 5 6
7
8
9 10
around 5 Hz for each point measured. In no case did the standard error exceed the difference between corresponding points in the two averages curves shown in (1). Mismatches between prosodic and syntactic constituent structure have been demonstrated for quite a few other languages, and are central to the literature (cited above) on the prosodic hierarchy. In Inkelas et al. (1987) we termed this the "lexical" tier, but we have renamed it "primary," following Hyman (1986). For a similar proposal see Snider (ms.). Clements (1981) offers a slightly different model, representing lowered Highs as the combination of Low and High tones in a single tonal matrix. See also Archangeli and Pulleyblank (1986), Hyman and Pulleyblank (1987). An additional difference is that the final syllable of alaramma has a primary Low in the declarative but a primary High in the question. We have no explanation for this except to note that speakers vary, some producing a final Low as High in interrogatives and others not. We further note that the last one or two High-toned syllables in a question are substantially higher than other Highs in the final phrase. To test for the significance of the suspension of downdrift in questions, we performed a test for linear trend developed by Meddiss (1984), using the ranks of F o values for each High-toned syllable in the final phrase other than the rightmost High-toned syllable, which undergoes extra raising in questions. We tested declarative and interrogative versons of two different sentences. The test showed that the downward trend across High syllables was significant at the 005 level for each declarative, whereas in the interrogatives a level trend proved to be significant (also at the 005 level). We performed a / test for the means of difference scores of two related samples. For each pass through the corpus we compared both the absolute rise in F o from the penultimate to final syllable of the tokens in the kalau and kawaa group, and the ratio of F o of those syllables. In no case did we find a probability less than 0*05 that the difference between the two samples was due to chance. Therefore, we were able to reject the hypothesis that the kalau and kawaa sentences had a significantly different F o contour over the last two syllables. That is to say, the factor of whether or not the final syllable belonged to an ideophone did not prove significant. See, for example, Pierrehumbert and Beckman's (1986) account of Japanese intonation. See Clements (1981) and Hyman (1986) for counteranalyses that avoid the n-ary use of features in such cases. References
Archangeli, D. and D. Pulleyblank. 1986. The content and structure of phonological representations. MS, University of Arizona and University of Southern California. Beckman, M. and J. Pierrehumbert. 1986. Intonational structure in Japanese and English. Phonology Yearbook 3: 255-309. Clements, G. N. 1981. The hierarchical representation of tone features. In G. N. Clements (ed.) Harvard Studies in Phonology vol. 2. Reprinted in I. Dihoff (ed.). Current Approaches to African Linguistics vol. 1. Dordrecht: Foris, 145-176. 1985. On the geometry of phonological features. Phonology Yearbook 2: 225-252. Hayes, B. 1984. The prosodic hierarchy and meter. To appear in P. Kiparsky and G. Youmans (eds.). Rhythm and Meter. Orlando: Academic Press. Hyman, L. 1985. Word domains and downstep in Bamileke-Dschang. Phonology? Yearbook 2. 1986. The representation of multiple tone heights. In K. Bogers, H. van der Hulst and 33
SHARON INKELAS AND WILL I AM R. LEBEN
M. Mous (eds.) The Phonological Representation of Suprasegmentals. Dordrecht:
Foris, 109-152. Hyman, L. and D. Pulleyblank. 1987. On feature copying: parameters of tone rules. To appear in L. Hyman and C. N. Li (eds.) Language, Speech and Mind: Studies in Honor of Victoria A. Fromkin. Croom Helm Ltd.
Inkelas, S. 1987a. Register tone and the phonological representation of downstep. Paper presented at the 18th Conference on African Linguistics, University of Quebec, Montreal. 1987b. Tone feature geometry. Paper presented at NELS 18, University of Toronto. Inkelas, S., W. Leben, and M. Cobler. 1987. Lexical and phrasal tone in Hausa. Proceedings of NELS 17. Leben, W. 1978. The representation of tone. In V. Fromkin (ed.). Tone: a Linguistic Survey. Orlando: Academic Press, 177-219. 1986. Syllable and morpheme tones in Hausa. JO LAN 3. Leben, W., S. Inkelas, and M. Cobler. To appear. Phrases and phrase tones in Hausa. In P. Newman (ed.) Papers from the 17th Conference on African Linguistics.
Dordrecht: Foris. Liberman, M. and J. Pierrehumbert. 1984. Intonational invariance under changes in pitch range and length. In M. Aronoff and R. Oehrle (eds.) Language Sound Structure: Studies in Phonology Presented to Morris Halle. Cambridge, MA:
MIT Press. Meddiss, R. 1984. Statistics Using Ranks: A Unified Approach. New York: Blackwell.
Nespor, M. and I. Vogel. 1982. Prosodic domains of external sandhi rules." In H. van der Hulst and N. Smith (eds.). The Structure of Phonological Representations. Part
I. Dordrecht: Foris, 225-255. 1986. Prosodic Phonology. Dordrecht: Foris. Newman, P. 1986. Tone and affixation in Hausa. Studies in African Linguistics 17:249-267. Newman, P. and R. Newman. 1981. The q morpheme in Hausa. Afrika und Ubersee 64:35-46. Pierrehumbert, J. 1980. The phonology and phonetics of English intonation. Ph.D. dissertation, MIT. Pierrehumbert, J. and M. Beckman. 1988. Japanese Tone Structure. Linguistic Inquiry
Monograph. Cambridge, MA: MIT Press. Poser, W. 1984. The phonetics and phonology of tone and intonation in Japanese. Ph.D. dissertation, MIT. Pulleyblank, D. 1986. Tone in Lexical Phonology. Dordrecht: Reidel. Selkirk, E. 1980. On prosodic structure and its relation to syntactic structure. In T. Fretheim (ed.) Nordic Prosody II. Trondheim: TAPIR. Phonology and Syntax. Cambridge, MA: MIT Press. On derived domains in sentence phonology. Phonology Yearbook 3: 371-405. Snider, K. Towards the representation of tone: a 3-dimensional approach. MS, Department of Linguistics, University of Leiden. Stahlke, H. 1975. Some problems with binary features for tone. In R. K. Herbert (ed.) Proceedings of the 16th Conference on African Linguistics. (Ohio State University Working Papers in Linguistics 20), 87-98.
Yip, M. 1980. The tonal phonology of Chinese. Ph.D. dissertation, MIT.
34
3 Metrical representation of pitch register D. ROBERT LADD
Introduction This paper discusses several interlocking problems in intonational phonology, all having to do in one way or another with the scaling of tonal targets in utterance fundamental frequency (Fo) contours. Evidence is presented for two broad claims: 1. that downstep and prominence are intrinsically related phenomena of pitch register; 2. that the phonological control of register makes reference to hierarchical metrical structure of the sort that can be independently motivated by facts about rhythm, focus, and so on. More generally, the paper attempts to clarify the essential differences between the now virtually standard theory of intonational phonology elaborated by Pierrehumbert and her colleagues (Pierrehumbert 1980, 1981; Liberman and Pierrehumbert 1984; Beckman and Pierrehumbert 1986) and the variant of this standard approach developed in my own work since the appearance of Pierrehumbert's thesis (especially Ladd 1983, 1986, 1987). Because the strictly experimental evidence discussed in the second half of the paper is quite limited - work to support some of the other interrelated claims is still in progress or being planned - the discussion of my overall approach is intended to enable the reader to evaluate the limited evidence presented here in its larger context. I presuppose without comment Pierrehumbert's basic autosegmental premise that F o contours in English and many other languages can be analyzed as strings of pitch accents, and that pitch accents are composed of one or more tones. (In addition to pitch accents, there are other elements occurring at the ends of various prosodic domains - Pierrehumbert's "phrase accent" and "boundary t o n e " but these will not concern us here.) I also take for granted one of the major descriptive insights of Pierrehumbert (1980), namely that "declination" - overall downward F o trend across phrases and utterances — can largely be attributed to the repeated application of a rule of downstep at successive pitch accents.1 35
D. ROBERT LADD
However, there are a number of differences of detail in the inventory of pitch accents I use; these differences reflect a set of interrelated theoretical differences over the phonological status of downstep and the phonological nature of intonational distinctions. These differences are outlined in the first half of the paper.
1 1.1
A model of intonational phonology and phonetics Non-gradient partial similarities in intonational phonology
The hallmark of Pierrehumbert's theory is its emphasis on linearity. According to her view, most of the distinctive aspects of fundamental frequency (Fo) contours - including downstep - can be represented phonologically in terms of a single underlying tonal string, which is both generated by a finite state grammar and phonetically realized from left to right. In this, Pierrehumbert's approach is more radically linear than a number of other autosegmental descriptions. These include both descriptions of African downstep that involve a theoretical construct of register - e.g. Clements's (1979) "tone level frame," or the parallel "register tier" of tones in Hyman (1985) and Inkelas and Leben (this volume) - and descriptions that assign a constituent structure to the tonal string - e.g. Clements (1983), Huang (1980) for African languages, and Liberman (1975), Ladd (1986) for the intonation systems of European languages. The model I will present here assumes a notion of register similar to Clements's tone frame together with a hierarchical structure that in effect accomplishes certain functions of the others' "register tier." Despite its linearity, Pierrehumbert's theory does assign an important role to various orthogonal factors in determining the phonetic details of F o . I shall be concerned in particular with two aspects of what might loosely be called " pitch range:" 1. the relative prominence of individual pitch accents, which is supposed to have a considerable local effect on the scaling of tonal targets; 2. the setting of "reference lines" for larger units of intonational structure such as intonational phrases, which setting expresses differences in overall F o level between one phrase and another. Prominence and reference line are phonetic realization parameters that enter into the computation of F o at any given phonologically specified target. Both are assumed to vary gradiently, i.e. to take on a meaningfully continuous range of values rather than to represent an either/or choice like downstep. In this respect Pierrehumbert's model is entirely traditional: most descriptions of intonation, however explicit they may be in most other respects, include some sort of provision that permits the range of F o within which a given intonational contour is realised to be expanded and contracted quite unpredictably, both globally and locally. Fujisaki and Hirose (1982: 59), to take another example, provide a 36
Metrical representation of pitch register
mathematically quite complex characterization of declination across accent peaks of underlyingly equal size, but then simply state that the accent commands "may vary in their height." Unconstrained gradient variability of prominence and other pitch range parameters is, in my view, the most serious empirical weakness of a great many quantitatively explicit models of F o . This is primarily because it makes hypotheses about intonational phonology and phonetics difficult to falsify: unpredicted aspects of experimental data can always be ascribed to poorly understood differences of local prominence. (Instances where this has been done include Pierrehumbert 1980: ch. 4 and - much more conspicuously - Cooper and Sorensen 1981: study 2.2.3.) The escape-hatch use of gradient prominence is all the more noteworthy because differences of prominence are otherwise largely ignored. For example, Fujisaki and Hirose apparently model contours using only two values of the accent command parameter; they refer without comment (1982: 68) to the "higher" and "lower" levels of accentuation, and simply do not confront the implications of having a parameter that in practice takes only two of a theoretically unlimited continuous range of values. Unquestionably, one of the major difficulties in discovering the phonology of intonation is the existence of substantial realizational differences within intonational categories - i.e. the fact that a given contour can be pronounced in meaningfully different ways without in some sense affecting its basic identity as "the same contour." The usual approach to dealing with this problem of partial similarity - the approach embodied in Pierrehumbert's description - is to posit a certain number of phonological categories and ascribe other differences to gradient variability along certain phonetic parameters, notably "pitch range." But this solution, as I have argued elsewhere (Ladd 1983, 1986) is deeply problematical. The central theoretical difference between my work and the standard view is that I recognise the existence of a third way in which intonation contours can differ from one another. This third type of distinction, which includes downstep and other register phenomena, is orthogonal to the phonological contrasts represented in the tonal string, yet is not equivalent to the truly gradient manipulations of pitch range that signal interest, arousal etc. It follows that another theoretical assumption underlying the work presented here is that gradient variability plays a much more restricted role in intonation than is normally supposed. By "gradient" I mean specifically the sense discussed by Bolinger (1961), referring to situations where a semantic continuum (e.g. increasing emphasis) is signaled by a phonetic continuum (e.g. increased Fo). Commonly assumed intonational cases include: greater interest or arousal signaled by higher overall F o (Williams and Stevens 1972); greater emphasis signaled by greater deviation from some F o reference line (Bolinger 1961; Liberman and Pierrehumbert 1984); greater finality signaled by greater depth of phrase-final F o fall (Bolinger 1986; Hirschberg and Pierrehumbert 1986). I do not deny that 37
D. ROBERT LADD
correlations such as these can often be observed, and I acknowledge that intonation presents special problems in this regard, but I do not believe that gradient realization parameters correctly express the workings of this kind of variability. The model presented in the next two sections develops an alternative view. 1.2
Pitch realization model
This section outlines a pitch realization model based on the theoretical ideas just summarized, which has been implemented in the text-to-speech system of the Centre for Speech Technology Research in Edinburgh. (It is referred to hereafter as the CSTR model.) This model owes obvious debts to those of Pierrehumbert, Fujisaki (e.g. Fujisaki and Hirose 1982), and Garding (e.g. 1983, 1987). Its principal innovation is the way it aims to constrain the analyst's recourse to gradient "pitch range" parameters in modeling F o contours. The model defines a speaking range for any given stretch of speech. Subject to a number of minor qualifications, this range is idealized as constant for a given speaker on a given occasion. It does not vary for phrase, sentence, and paragraph boundaries, or for different degrees of local prominence or emphasis, these all being treated as functions of register rather than range. Range can, however, vary for truly paralinguistic differences of overall interest, arousal, etc., as in Liberman and Pierrehumbert (1984) or Bruce (1982). Mathematically, range is defined in terms of two speaker-specific parameters, a baseline and a default initial register setting. Loosely speaking, we can think of this as setting a bottom and a top. Further details are given in the appendix. Within the range, a frequently-changing register is defined. The term register is used more or less as in other current work (e.g. Clements 1979; Hyman 1985) to refer to a band of F o values - a subset of the full range - relative to which local tonal targets are fixed. For English, the top and bottom of the register are the local default values of H and L tones respectively, while the middle of the register is the local default value for a neutral pitch from which F o movements to H or L begin, and to which F o tends to return after H or L. As I just noted, the F o correlates of phrasing and prominence are modeled in terms of register, in one of two ways. First, the register changes frequently, thereby changing the local values of tones. Second, prenuclear accents can be scaled "inside" the register, i.e. prenuclear Hs may not be as high as the default H and prenuclear Ls not as low as the default L.2 The basic constructs of the model are illustrated in figure 3.1. In certain respects, register is comparable to the "grid" in Garding's model. However, there are three important differences, all of which make the CSTR model more constrained than Garding's. Garding's grid can take on an implausibly large variety of shapes, slopes and widths; indeed, it seems capable of modeling virtually any contour, occurring or non-occurring. The approach taken here substantially restricts the model's possible outputs. 38
Metrical representation of pitch register H
0 O5 C
O5 (1) DC I
L% Figure 3.1 Basic concepts of the CSTR pitch realization model. Elements identified by letters are: (a) prenuclear accent H; (b) Fo transition; (c) nuclear accent HL; (d) low final boundary tone.
The three restrictions are the following: 1. Register in the CSTR model can change only in highly constrained ways, stepping up or down by fixed amounts at certain well-defined points. (The mathematical characterization of register steps is based on Pierrehumbert's downstep model and is identical to the independently developed formula presented in Clements's commentary, this volume; more detail is given in the appendix.) This means that, compared to Garding's grid, the "grid" generated by a sequence of register steps can have relatively few distinct shapes and slopes. 2. Gradient differences of prominence are assumed to affect only prenuclear accents in the CSTR model, reducing them relative to the nuclear accent. Nuclear accents are as it were maximally prominent and are scaled at the top or bottom of the register. (Among other things, this provision makes it impossible to scale a downstepped accent higher than the preceding accent relative to which it is downstepped, which seems theoretically possible in Pierrehumbert's model.) In Garding's model, tones can be scaled both outside the grid (for greater prominence) and inside the grid (for lesser prominence). In the absence of a theory of how prominence is determined, this possibility seems to undermine the empirical basis of the grid itself. 3. The definition of register in the CSTR model makes crucial reference to target level, whereas Garding's grid emphasises contour shape. Specifically, the CSTR model presupposes some comparability of pitch levels across sentences, based on the hypothesis that the highest high in a sentence has a default level that is a reasonably constant speaker characteristic. (This is fixed by the default initial register setting, which is one of the two parameters that define the speakerspecific range.) Garding's grid, by contrast, would be free to start anywhere, and initial accent scaling in both Fujisaki's and Pierrehumbert's models is directly affected by the gradient variability of phrase-level realization parameters.3 39
D. ROBERT LADD
Various features of the CSTR model are illustrated with the hypothetical contours in figure 3.2. Figure 3.2a shows how prominence differences may be signalled by reducing prenuclear accents relative to the following nuclear accent in the same register. In made it to the airport, airport is scaled higher than made because the prominence on made is reduced, not because the prominence on airport is increased. In the flight was cancelled, flight is prenuclear but fully prominent, and is therefore scaled the same as cancelled} Figure 3.2a also shows how the various distinctions under consideration here may be, so to speak nested: the reduction of made relative to airport, and the nonreduction of flight relative to cancelled, both take place within the larger framework of the register downstep from the first phrase to the second. Figure 3.2b shows the same nesting, but this time with the individual phrases downstepping as well: there is downstep at each accent in the two phrases When Harry arrived at the airport and he was arrested immediately, and in addition the second phrase is downstepped relative to the first. The fully prominent scaling of (a)
• Fo min H
HL
H
HL
L%
I made it to the airport, but the flight was cancelled.
(b)
Fo min HH
HL
H
H L L %
When Harry arrived at the airport, he was arrested immediately. Figure 3.2 Models of idealized Fo data for two sentences. 40
Metrical representation of pitch register
all the accents - with each accent peak at the top of the newly changed register - seems to be characteristic of downstepping tone groups. Comparison between figures 3.2a and 3.2b shows the comparability of register across sentences: the highest accent in 3.2a (airport) is scaled the same as that in 3.2b (Harry), despite their different position and different prominence relations within their respective sentences. Similarly, comparison of the highest accents in the second half of each sentence shows that they both start one register step down from the highest accent of the whole sentence. In 3.2b, where the first half of the sentence is downstepping, this leads to a "resetting of declination" in the output F o contour; 3.2a is superficially quite different, since F o steps down rather than up across the phrase boundary, but this arises from the nesting of phrase characteristics within the higher-level downstep from the first phrase to the second. 1.3
Metrical control of register
A quantitatively explicit characterization of register is only half the story. To give a complete account of downstep and prominence we must also state the phonological basis of register distinctions, the linguistic factors that control when and by how much register changes. This is closely linked to the problem of describing those intonational distinctions that I described above as non-gradient but orthogonal to the distinctions of the tonal string. My first attempt to formalize these distinctions was to attach "features" to the tones in the string, such as [downstep-] or [raised peak] (Ladd 1983). In the case of downstep, the feature description makes it possible to express the intuition that a string of high accents has some sort of phonological identity (distinct from e.g. a string of low-rising accents) regardless of whether the general trend across the accent peaks is downward or fairly level. That is: (1)
a. I really don't think he ought to be doing that. H
H [d.s.]
H [d.s.]
HL [d.s.]
L%
is more similar, both formally and functionally, to
b. I really don't think he ought to be doing that. H
H
H
HL
L%
D. ROBERT LADD
than either is to
c. I really don't think he ought to be doing that. L
L
L
HL
L%
An analysis with a separate register tier would be comparable; for these purposes it obviously makes little difference whether we write H H H H
I II L L L or H H H HL [d.s.] [d.s.J [d.s.] The point of either analysis is that downstep involves a categorical phonological distinction, but that, like the supposedly gradient differences of prominence, it is a distinction that somehow does not obscure the basic identity of the tonal string. However, there is an inherent problem with the use of features to describe downstep. As Beckman and Pierrehumbert (1986: 279 ff.) have pointed out, the use of a downstep feature makes it possible to represent nonoccurring - even nonsensical - strings of accents. Consider a string of pitch accents that I would represent as H HL (see figure 1). We want to be able to choose between downstepping and not downstepping the second of the two accents, i.e. we want two strings. H HL and H HL [d.s.] But there is no way (other than arbitrary stipulation) to prevent a feature [downstep] from being applied to the first pitch accent as well, giving strings like H
HL
[d.s.] [d.s. or even H
HL
[d.s.] In other words, the use of a downstep feature in no way expresses the inherently relational nature of downstep, the fact that a downstepped accent must be downstepped relative to some earlier accent. 42
Metrical representation of pitch register
To solve this problem I propose that phonologically downstep is a metrical relationship between intonational constituents. The use of a metrical tree expresses
downstep's relational nature in a way that features or register tones cannot. In the case of the two-accent string, it permits exactly the two-way distinction we want: I
h
H
HL
(non-downstepped)
hi H HL (downstepped)
The metrical notation is intrinsically unable to represent anything corresponding to either of the meaningless strings. H
HL
[d.s.]
[d.s.]
H [d.s.]
HL
and
Yet at the same time it preserves the ability to treat downstep as something outside the system of tonal contrasts. (Note, however, that the phonological symmetry of h-1 vs 1-h is not matched in the phonetic realization of these structures. This will be discussed shortly.) Moreover, a metrical representation of downstep automatically subsumes differences of overall pitch level between intonational phrases. These are the relationships analyzed by Pierrehumbert in terms of gradiently different specifications of the "reference line," and by Fujisaki as commands of different sizes to the phrase component. The reference to "nesting" downsteps in the previous section, which implicitly involves reference to a metrical structure, has to do with exactly this phenomenon. We might for example, represent the sequence of register steps in figure 3.2b as follows:
To the extent that register relations between phrases can be shown to behave phonetically like downstep relations within phrases, then a unified metrical representation for both is obviously appropriate. In order to test the model's predictions about the phonetic details of register relations, it is first necessary to provide an account of how to linearize the h-1 43
D. ROBERT LADD
metrical trees into a string of commands for shifting the register. This problem is conceptually analogous to the problem of getting from a s-w metrical tree to a linear representation of syllable prominence, which Liberman and Prince (1977) dealt with by means of their Relative Prominence Projection Rule (RPPR) and the notion of the metrical grid. My solution to linearizing metrical trees resembles Liberman and Prince's in several respects. First let us define a notion Highest Terminal Element (HTE), analogous to the DTE of a prominence tree: The HTE of any metrical tree or subtree is that terminal element arrived at by following all the branches labelled h from the root of the tree or subtree.
Given this, we may then formulate a Relative Height Projection Rule (RHPR). This resembles the Liberman-Prince RPPR in spirit, but the detail is more complicated because of the asymmetry between h-1 (which produces downstep) and 1-h (which does not produce upstep, but simply leaves the register unchanged). RHPR (first and only approximation): In any metrical tree or constituent, the HTE of the subconstituent dominated by / is: a. b.
One register step lower than the HTE of the subconstituent dominated by h when the / subconstituent is on the right; at the same register as the HTE of the subconstituent dominated by h, when the /subconstituent is on the left.
Graphically: / \
/ \
hi H
I 1
h
H
I emphasize that this is a first approximation, which will need to be modified in the light of further research. At a minimum, a revised RHPR would need to allow for interaction with local (nonmetrical) effects on register such as final lowering.
2 Experimental evidence There are obvious similarities between the metrical account of register here and the metrical representation proposed by Clements and others (Huang 1980, Clements 1983) to describe downstep in the terrace-level tone languages of Africa. The basic idea of those proposals is to group together strings of tones spoken on the same tone-terrace into "tonal feet" and to organize the tonal feet into a rightbranching structure with sister nodes labeled h and 1. The structures proposed here are richer, in that they allow left-branching as well as right-branching, and sister nodes labeled 1-h as well as h-1. Moreover, the structures posited here 44
Metrical representation of pitch register
represent an independent phonological choice in the intonational system, whereas the metrical structures proposed for African downstep are an aspect of phonetic realization, derived by rule from the tonal string. However, it is doubtful that these constitute essential differences between the two proposals; it strikes me as far more likely that they reflect differences between the African and European languages. The real question is whether a metrical structure is needed at all. In discussing the case of English, Pierrehumbert (1980) argues strongly against metrical representation of downstep, primarily on the grounds that metrical formalisms are needlessly powerful (in particular, they make it possible to represent nonlocal dependencies). In the Africanist literature, the metrical proposals appear to have been supplanted by the register tier descriptions mentioned earlier. Since the metrical proposals have apparently found little favor, it is appropriate to provide evidence that their extra power is actually needed in the description of pitch contours. This half of the paper briefly discusses three experiments that bear on this general issue. 2.1
Differences of phrase-initial resetting
It is not difficult to observe sentences in which the appropriate setting of F o following a sentence-medial boundary seems to depend on the sentence's overall syntactic structure — specifically on the attachment in surface structure of the phrase that begins at the boundary. This is illustrated in the following pair of examples: (2)
a. There was a new-looking oilcloth of a checked pattern on the table, and a crockery teapot, large for a household of only one person, stood on the bright stove. b. There was a new-looking oilcloth of a checked pattern on the table, and a crockery teapot, large for a household of only one person, standing on the bright stove.
The appropriate reset on crockery teapot appears greater in (2a), where the comma pause occurs at a boundary between two sentences (main clauses) conjoined at the top of the syntactic tree, than in (2b), where the comma pause occurs at a boundary between two noun phrases within the single main clause. This is prima facie evidence for some sort of nonlocal dependency in intonation, of the sort that could appropriately be described by metrical structures like those proposed in the preceding section. In order to study such dependencies more systematically, I did an experiment in which I measured F o values in sentences such as the following:0 (3)
a. Allen is a stronger campaigner, and Ryan has more popular policies, but Warren has a lot more money, {and/but structure) b. Ryan has a lot more money, but Warren is a stronger campaigner, and Allen has more popular policies, (but/and structure) 45
D. ROBERT LADD
But/and And/but
A1 A2 A3 B1 B2 B3 C1 C2 C3 Clauses Figure 3.3 Data for one speaker from experiment 1. Accents are numbered consecutively 1-3 within each clause A-C. Each data point represents the mean of 17 or 18 measurements.
The most natural interpretation of these sentences appears to be one in which the but opposes one proposition to the two propositions conjoined by and: (4)
[A and B] but [C]
[A] but [B and C]
Translating this bracketing into tree notation, it can be determined that the butboundaries involve a higher attachment than the tf«-boundaries. In the terms proposed by Cooper and Paccia-Cooper (1980), the /^-boundaries are stronger than the #/^-boundaries. The sentence pairs thus isolate the factor of hierarchical attachment or boundary strength from potential confounding variables such as differences of length and semantic content. For the experiment, I constructed 18 different versions of each of the two bracketing structures for the sentences just illustrated, systematically mixing the subjects and predicates to control for order and repetition effects and for segmental effects on F o . Recordings of the 36 sentences were made and analyzed for four speakers of British English, with mean F o values determined for the three accent peaks in each clause and for the low point at the end of each clause. Results for one speaker are shown graphically in figure 3.3; for more details see Ladd (1988). Briefly, the post-boundary accent peaks (Bl and Cl) generally show an -effect of hierarchical structure, being scaled higher after but than after and. Given the kind of metrical representation proposed above, the obvious way to represent these differences is as shown in figure 3.4. However, it is not clear how 46
Metrical representation of pitch register
H
H HL
H
H HL
H
H HL
Figure 3.4 Possible metrical trees representing the two structures in experiment 1. 4a: and/but; 4b: but/and.
to get from this representation to the pattern of target scaling in figure 3.3. On the basis of the tree in figure 3.3 the preliminary RHPR given above correctly predicts (1) that peak Al is scaled the same in both sentences, since it is the sentence's HTE; (2) that the clause-initial peaks in the but/and condition show a steady downstepping trend; and (3) that in the and/but condition Bl and Cl are about the same, since both are scaled relative to Al. But it does not predict (a) the difference of scaling on Bl between conditions, nor (b) the apparent lack of dependence of the clause-final peaks (A3, B3, C3) on their respective clause-initial peaks. Further refinements are clearly needed, possibly including a rule of "initial raising" analogous to final lowering. 2.2
Cross-sentence comparability of level in nested metrical relationships
In an experiment which firmly established the relevance of F o targets in the control of intonation, Liberman and Pierrehumbert (1984; see also Pierrehumbert 1980) measured the relative height of the two accent peaks on the utterance Anna came with Manny under various conditions. The utterance was spoken with two intonations, making it appropriate as an answer to the questions "What about Anna, who did she come with?" and "What about Manny, who came with him?" In addition, the utterance was spoken in various pitch ranges or "degrees of overall emphasis." Plotting the results for four different speakers showed that for both intonation patterns the relationship between the two peaks as the range increased approximated a straight line, i.e. there is a constant relationship between the two peaks. 47
D. ROBERT LADD
In accordance with the notion that such relationships can be nested, as suggested in the discussion of figure 3.2, I did another experiment in which the Anna/Manny sentences were varied by giving the characters family names. I did not manipulate pitch range, as I was interested in comparing peaks across conditions in the same range - i.e. I wanted to see how the peaks on Anna Lowry would compare to those on Anna alone. I hypothesized that the relationship holding between Anna and Manny in the original experiment would hold between their surnames in the follow-up experiment, and that a separate relationship (presumably expressing relative prominence) would simultaneously hold between first name and family name in both halves of the sentence. That is, I made two related predictions: (1) that in the phrase Anna Lowry, given that Lowry is more prominent than Anna, speakers would scale the accent on Anna lower than it would have been scaled in the version of the sentence where Anna occurred alone; (2) that Anna and Manny in a first-name-only version of the sentence would be scaled the same as the two family names in a first-name-plus-family-name version. Because of the relative rarity of the fall/fall-rise intonation on a single clause in British English, I changed the text and used sentences like the following: (5)
a. It wasn't Anna Myers, it was Alan Lowry. (fall-rise/fall) b. It was Anna Myers, not Alan Lowry. (fall/fall-rise)
There were 96 such sentences in all, with combinations of two first names (Anna, Alan), and three family names (Myers, Lowry, Wylie) in four conditions with differing structures (family name only; first name and family name; first name and family name with contrast on first name and therefore deaccented family name [It was Anna Myers, not Alan Myers]; and first name and family name with contrast on family name [It was Anna Myers, not Anna Lowry]). In other words, for each of the two intonation types, there were two conditions that were comparable to Liberman and Pierrehumbert's sentence in that they contained only two accent peaks, and two that were different in that they contained four. The most important prediction was that the scaling of the second and fourth peaks in the four-accent sentences would be the same as that of the first and second, respectively, in the twoaccent ones. Figure 3.5 shows the results for three conditions from two speakers, which by and large support that prediction. They seem to demonstrate that, if there is a close constituent relationship between the first accent of a phrase and the second, and if the second is not downstepped, then the second accent comes out where it would if it were the only accent in its phrase. The scaling of the first of the two accents does not, as in many models, somehow "initialize" the range setting for a phrase, but simply represents a choice of prenuclear prominence within the default initial register. The accent can be more or less prominent - can be scaled inside the register or not - without in any way affecting the register itself and therefore without affecting the scaling of the second accent. This is seen most clearly in the difference between the two four-accent cases: sentences in which the 48
Metrical representation of pitch register
•*- Repeated first
-*- First only
First-family 160
A2 B1 Clauses
A2 B2 Clauses
B2
•• Repeated first -°- First-family
-*• First only 135
135
A2 B1 Clauses
B2
A2
B2 Clauses
Figure 3.5 Data for two speakers from experiment 2, sentences of the form It was X, not Y. Each data point represents the mean of 12 measurements. The condition in which X and Y are first name + family name, with contrast on the first name and a (deaccented) repeated family name, has been omitted because of complications beyond the scope of this paper.
first name was repeated (It was Anna Myers, not Anna Lowry) had a lower first accent peak than those in which both the first name and the family name contrasted (It was Anna Myers, not Alan Wylie), but the scaling of the accent peak on the surname was unaffected. 49
D. ROBERT LADD
A similar experiment was performed independently by Pierrehumbert and Liberman at about the same time as the one just outlined, with comparable results. (It too has remained unpublished.) The main difference of method was that Pierrehumbert and Liberman, as in the Anna/Manny experiment, had speakers produce the test sentences in different pitch ranges. They concentrated on comparing F o target ratios across pitch ranges, rather than (as in the present experiment) comparing target levels across conditions. Nevertheless, their results, like mine, clearly demonstrate that register relationships can be nested. Their interpretation of this finding involves two distinct types of relations, namely prominence relations between accents within an intonational phrase and relations of overall level ("reference line") between intonational phrases within an utterance. My interpretation in terms of a single tree structure is in keeping with the notion of recursive prosodic phrasing outlined in Ladd (1986). Further discussion is beyond the scope of this paper, but more details on the experiment will be presented in Ladd (forthcoming b). 2.3
Effects of branching structure on scaling of HTEs
The two experiments just described provide a certain amount of evidence that the scaling of the HTE of a sentence is comparable across utterances by the same speaker. Yet there is also evidence suggesting that a sentence-initial accent peak in a downstepping contour — the top of the declination slope, as it were — increases in F o as the sentence increases in length. Cooper and Sorensen (1981), among others, have concluded on the basis of experimental results that F o toplines start sharply higher if the phrase is longer; such a relationship is also assumed in the quantitative model of declination proposed by 't Hart (1979). What is of interest here is that this relationship has not been found consistently. Sternberg et al. (1980) found no effect of increased length, while Thorsen (1980, 1981) and Liberman and Pierrehumbert (1984) report only very small increases in F o accompanying increases in sentence length. It seems likely that metrical structure is a relevant factor in these contradictory results. Liberman and Pierrehumbert expanded their test sentences simply by appending further material on the right and leaving the metrical structure at the beginning of the phrase unaffected. Cooper and Sorensen, on the other hand, expanded most of theirs by lengthening several constituents (e.g. The deer could be seen from the car vs. The deer in the canyon could be seen from the window of the car).
Consideration of a variety of data led me to hypothesize that, despite the substantial constancy of sentence HTE, there is at least one metrical configuration in which the HTE is higher than normal, namely: R
H H(L) X...
50
Metrical representation of pitch register
More concretely, if the beginning of a downstepping sentence is right-branching, then the first peak will be scaled approximately the same, regardless of the length of the phrase.6 If, however, the first accent is not immediately dominated by the top of the tree - if, in effect, the left branch branches - then the first accent will be scaled significantly higher. Sentences that illustrate this prediction are:
(6)
a.
The smallest were made into jelly
b.
The smallest were eventually made into jelly
c.
The smallest plums were made into jelly
Sentences (6a) and (6b), being both right-branching, should have their first accent peaks scaled at the same height, while (6c), with a branching left branch, should start higher. These predictions were tested in an experiment carried out by Catherine Johnson, a former undergraduate in the Linguistics Department at Edinburgh, as the basis of her Honors dissertation (Ladd and Johnson 1987). She had two speakers read sentences of the three branching types illustrated in (6). In addition, as a further check on possible confounding effects, the sentences were of three different syntactic types: subject/predicate/adverb/main clause, and main verb/infinitive or clause complement. For example: (7)
a. The smallest were made into jelly. b. In August they harvest the barley. c. They persuaded them that the climate was favorable.
The prediction was (1) that branching structure would have a significant effect on the height of the first peak, but that syntactic type would not; (2) that the "initial left-branching" structure would have a significantly higher first accent peak than either of the two right-branching ones, which would differ between themselves only slightly or not at all. Up to a point these predictions were confirmed (see figure 3.6). In particular it was demonstrated that branching structure has a significant effect on the height of the first peak. However, the effect was as predicted for only one of the two speakers; for the other speaker the three branching conditions were about equally
D. ROBERT LADD
190
P1 P-Nult P-Last Intonation peak
145
105 P1 P-Nult P-Last Intonation peak Figure 3.6 Data for both speakers from experiment 3. The three accents plotted are the first (P-1), penultimate (P-nult), and last (P-last); i.e. the second accent of the two four-accent conditions is not shown. The three conditions are labeled 3RB (three accent, all right branching), 4RB, (four accent, all right branching), and 4ILB (four accent, initial left branching). Average values across the three syntactic types, i.e. each data point represents the mean of approximately 28 measurements.
Metrical representation of pitch register
different in their scaling of the first peak. Also, one of the two speakers showed some effect of syntactic type on the first peak.7 More work needs to be done to unravel the effects of these various factors. In any case it is clear that metrical structure does play a significant role in the scaling of high peaks. To be sure, the experimental result does not follow from the preliminary RHPR as stated above, and in fact points to the need for further refinements in that rule. At the same time, the metrical analysis and the RHPR do provide a framework for describing such results. This cannot be said of the traditional account based on gradient differences of prominence without any crosssentence comparability. For one thing, the accent on smallest in the initial leftbranching condition should be less prominent (and hence have lower Fo) than in the other two conditions. More generally, the finding reported here lends plausibility to the assumption that what seems to be gradient variability of pitch range can eventually be shown to result from the interaction of a variety of mostly nongradient factors.8
Conclusion The experiments discussed in the second half of the paper can hardly be said to validate the model presented in the first half. They may, in fact, do little more than provide some basis for various hunches that the model then makes explicit. However, they do seem to support the model's premises in at least two ways. First, it seems clear that some form of metrical representation - possibly less florid than a complete metrical tree - is required as part of the input to a descriptively adequate phonology of intonation. Second and more generally, since the effects attributable to this metrical structure are significant but small, the experiments also suggest that any theorist of intonation who simply accepts the gradient variability of phonetic realization parameters as an important ingredient of target scaling is likely to overlook much that is of interest to the general problem of modeling F o .
Appendix The basic form of the model for computing F o targets is: log F o = log F mln x f(N) x f(T) where F min is the speaker baseline, f(N) is the current register setting, and f(T) is the current tonal specification. The similarity of this equation to the general form (but not the detail) of Fujisaki's model will be apparent. Any given register setting is defined by the equation: f(N) = N dj 53
D. ROBERT LADD
where N is the speaker specific range parameter that specifies the default initial register setting (the parameter is called N because it is intended to normalize many range-related differences between speakers). The factor by which register steps down is symbolized by d (where 0 < d < 1; the value used for synthesis at CSTR is 0.8). Successive downsteps are modeled as successive multiplications by the downstep factor, i.e. d\ where i is a positive integer that increments by 1 at each downstep. (Rules like final lowering might require changes in i to non-integer values, but this is a matter for further research.) For the default initial register setting, i = 0, which entails f(N) = N. The idea of successive multiplication by the downstep factor is adopted from Pierrehumbert (1980). The difference between Pierrehumbert's model and the one outlined here is that the present approach makes it possible to characterize the register setting at any given point in the contour, independently of actual F o values, and - in effect - without any reference to how the register got to be where it is. (For example, N x d3'2 would be interpretable, and might even be necessary as a phonetic model of a given set of F o data even though the phonological rules that would give rise to i = 3.2 remained obscure.) It is the ability to provide an independent characterization of register that was the motivation for Clements's tone level frame (1979); Clements's current mathematical formulation of downstep (this volume) is, as noted above, identical to the one presented here. The top of the register (i.e. H tone) at any given point is f(N) x w, where w is a parameter determining the width of the register. The bottom of the register (i.e. L tone) is f(N) / w. (The value of w used for synthesis at CSTR is 1.5; it will probably be necessary to recognize that w is both language and speaker specific.) More generally, any given tonal specification is defined by the equation: f(T) = w T p
where T = + 1 for H, — 1 for L, and 0 for the middle of the register. The value of p (for prominence) must lie between 0 and 1. If the difference between reduced and nonreduced prenuclear prominence turns out to be a categorical rather than a gradient difference (cf. note 3), then p would have only two values, 1 and some fraction.
Notes The basic metrical analysis outlined here was first presented to the Autumn meeting of the Linguistics Association of Great Britain in Edinburgh in September 1986. I acknowledge the help of my colleagues at the Centre for Speech Technology Research, especially Alex Monaghan and Mike McAllister, in the development and implementation of the intonational synthesis model described here. I am especially grateful to Mary Beckman and John Kingston for their comments on the first version of this paper - comments that went well beyond the call of editorial duty and led to substantial improvements in the analysis. Responsibility for the content of the paper is of course mine alone. 1 Whether some small amount of observed declination should be accounted for by some essentially physiological or otherwise non-phonological process-"true declination," as it were, distinct from downstep - 1 will leave an open question, though the evidence in favor of such a distinction is quite strong. (See especially the work on Japanese by Poser 1984; Pierrehumbert and Beckman 1988; Kubozono 1988.) In terms of the model presented below, true declination could best be modeled in terms of a gradual decline in the parameter F . . 54
Metrical representation of pitch register
I assume that English downstep, downstep and downdrift in various African languages, and "catathesis" (Poser 1984) in Japanese are all fundamentally equivalent, and can all be treated in terms of register steps in the phonetic realization model presented in part 1 Section 2.1 thus reject the need for Poser's neologism (pace Beckman and Pierrehumbert 1986: note 9, who defend the terminological proliferation on the grounds that the phenomena may not be equivalent). 2 This represents a modification of the model published in Ladd (1987), where an actual upstep was used to model the relation between a less prominent prenuclear H and the following nuclear H. 3 In order that the claim of cross-sentence comparability of level not be taken as patently false, it is important to emphasize two things: 1. It refers to a default highest level. In any given sentence this is subject to various modifications, which it is beyond the scope of this paper to describe (but cf. part 2, section 3 below). 2. It presupposes the same setting of overall range, unaffected by emotional or other paralinguistic factors (but note that I define these much more narrowly than many investigators; cf. the model's definition of range above).
4
5
6
7 8
Experimental data that are at least consistent with the general notion of a default highest high are presented in the second half of the paper. In the terminology introduced in Ladd (1983), made it to the airport would involve a "raised peak" and the flight was cancelled would not. While this analysis is rejected here, it is worth pointing out that it was an attempt to express ideas that are incorporated into the present description: (1) that "raised peak" is not simply gradiently increased prominence on the nucleus; and (2) that "raised peak" and "downstep" are somehow intrinsically opposed to one another. To be sure, it is possible that the difference between reduction and non-reduction of prenuclear accents should also be seen as a categorical rather than a gradient difference, in which case the "raised peak" analysis would find further justification. This question is left for further research (cf. appendix). These test sentences are superficially quite unlike the teapot sentences, but for experimental purposes it was important to keep the sentences as similar as possible while varying the bracketing structure. Test sentences more like the teapot sentences would have involved other differences accompanying the bracketing differences, and would thus have more easily permitted alternative explanations for the experimental results. Any slight tendency to increase initial F o with increasing length I would analyze as an increase in one of the two range parameters F min and N (see appendix). If F min is used to model "true declination" as suggested in note 1, then it would probably be appropriate to use it for this effect as well. This appeared to be due to higher initial peaks in the adverb/main clause sentence than in the other two types. In this connection it is worth mentioning that the experimental work carried out by Haruo Kubozono for his Edinburgh Ph.D. thesis (1988) shows pervasive influences of branching structure on the patterns of F o target scaling in Japanese noun phrases. It does not appear possible to interpret Kubozono's results as involving the application or non-application of Poser's "catathesis" rule, since the effects are too finely matched to the differences of metrical structure - that is, they do not appear to reflect simple presence or absence of a prosodic domain boundary that would block catathesis. Discussion of the details is beyond the scope of this outline.
55
D. ROBERT LADD
References Beckman, Mary E. and Janet B. Pierrehumbert. Intonational structure in Japanese and English. Phonology Yearbook 3: 255 — 309. Bolinger, Dwight. 1961. Generality, Gradience, and the All-or-None. The Hague: Mouton. 1986. Intonation and its Parts. Stanford: Stanford University Press. Brown, Gillian, Karen Currie and Joanne Kenworthy. 1980. Questions of Intonation. London: Croom Helm. Bruce, Gosta. 1982. Developing the Swedish intonation model. Working Papers, Department of Linguistics, University of Lund, 22, 51—116. Clements, G. N. 1979. The description of terraced-level tone languages. Language 55: 536-558. 1983. The hierarchical representation of tone features. In I. Dihoff (ed.) Current approaches to African linguistics. Vol. 1. Dordrecht: Foris Cooper, William and Jeanne Paccia-Cooper. 1980. Syntax and Speech. Cambridge, MA: Harvard University Press. Cooper, William and John Sorensen. 1981. Fundamental Frequency in Sentence Production. Heidelberg: Springer-Verlag. Fujisaki, Hiroya and Keikichi Hirose. 1982. Modelling the dynamic characteristics of voice fundamental frequency with applications to analysis and synthesis of intonation. In Preprints of Papers, Working Group on Intonation, 13th International Congress of Linguists, Tokyo, 57-70. Garding, Eva. 1983. A generative model of intonation. In A. Cutler and D. R. Ladd (eds.) Prosody: Models and Measurements. Heidelberg: Springer-Verlag, 11-25. 1987. Speech act and tonal pattern in Standard Chinese: constancy and variation. Phonetic a 44: 13-29. 't Hart, J. 1979. Naar automatisch genereeren van toonhoogte-contouren voor tamelijk lange stukken spraak. Eindhoven: IPO Technical Report no. 353. Hirschberg, Julia and Janet B. Pierrehumbert. 1986. Intonational structuring of discourse. Proceedings of the 24th Meeting of the Association for Computational Linguistics, New York, 136-144. Huang, C.-T. James. 1980. The metrical structure of terraced-level tones. In J. Jensen (ed.) NELS 11 (Cahiers Linguistiques d*Ottawa vol. 9), Department of Linguistics, University of Ottawa, 257-270. Hyman, Larry M. 1985. Word domains and downstep in Bamileke-Dschang. Phonology Yearbook!: 45-82. Kubozono, Haruo. 1988. The organisation of Japanese prosody. Ph.D. thesis, University of Edinburgh. Ladd, D. R. 1983. Phonological features of intonational peaks. Language 59: 721-759. 1986. Intonational phrasing: the case for recursive prosodic structure. Phonology' Yearbook 3:311-340. 1987. A phonological model of intonation for use in speech synthesis by rule. In Proceedings of the European Conference on Speech Technology, Edinburgh 2: 21—24. 1988. Declination "reset" and the hierarchical organization of utterances. Journal of the Acoustical Society of America 84:530-544. (forthcoming). A study of the scaling of certain contrastive accent peaks. Ladd, D. R. and Catherine Johnson. 1987. "Metrical" factors in the scaling of sentence-initial accent peaks. Phonetica 44:238-245. Liberman, Mark. 1975. The intonational system of English. Ph.D. dissertation, MIT. 56
Metrical representation of pitch register
Liberman, Mark and Janet Pierrehumbert. 1984. Intonational invariance under changes in pitch range and length. In M. Aronoff and R. Oerhle (eds.) Language sound structure. Cambridge MA: MIT Press, 157-233. Liberman, Mark and Alan Prince. 1977. On stress and linguistic rhythm. Linguistic Inquiry 8: 249-336. Pierrehumbert, J. 1980. The phonology and phonetics of English intonation. Ph.D. dissertation, MIT. 1981. "Synthesizing intonation". Journal of the Acoustical Society of America 70: 985-995. Pierrehumbert, Janet and Mary Beckman. 1988. Japanese Tone Structure. Cambridge, MA: MIT Press. Poser, William J. 1984. The phonetics and phonology of tone and intonation in Japanese. Ph.D. dissertation, MIT. Sternberg, Saul, C. E. Wright R. L. Knoll and S. Monsell. 1980. Motor programs in rapid speech: additional evidence. In R. A. Cole (ed.) Perception and Production of Fluent Speech. Hillsdale, NJ: Lawrence Erlbaum Associates, 507-534. Thorsen, Nina. 1980. Intonation contours and stress group patterns in declarative sentences of varying length in ASC Danish. Annual Report, Institute of Phonetics, University of Copenhagen, 14: 1-29. 1981. Intonation contours and stress group patterns in declarative sentences of varying length in ASC Danish-Supplementary data. Annual Report, Institute of Phonetics, University of Copenhagen, 15: 13-47. Williams, C. E. and Kenneth N. Stevens. 1972. Emotion and speech: some acoustical correlates. Journal of the Acoustical Society of America 52: 1238-1250.
57
4 The status of register in intonation theory: comments on the papers by Ladd and by Inkelas and Leben G. N. C L E M E N T S
The studies by Ladd and by Inkelas and Leben make an important contribution to the vigorous, continuing debate on foundation issues in intonation theory. They are both concerned with establishing the basic formal mechanisms used to describe intonation, understood here (in a narrower sense than the usual one) as the way in which tonal and accentual patterns are mapped into phonetic F o contours. What is central to these proposals is the significance of the role played by register. Both papers view register as a fundamental component of a linguistic description of intonation, and propose to characterize it in terms of separate intonational structures built over the tone melody (a sequence of pitch accents or tones), to which rules assigning F o have access. There are broad areas of agreement in these papers. Their common strategy is to shift some of the burden of accounting for intonational regularities from phonetic implementation rules to phonological structures defining global intonation patterns. Both papers argue that such an approach can capture generalizations that would otherwise go unexpressed, and allows the phonetic implementation rules to be simplified or constrained in desirable ways. To achieve these results Ladd introduces metrical tree structures, while Inkelas and Leben propose to recognize a new tier of autosegmental representation which they term the "register tier." These remarks will focus primarily on Ladd's and Inkelas and Leben's views on how register is to be formally characterized in intonation theory. I will suggest that while we do want a way of expressing register as an independent parameter of pitch assignment rules, we may not require the full power of metrical and autosegmental structures to obtain an adequate account of register shift phenomena. I will offer a brief presentation of an alternative view of register, which retains some of Ladd's and Inkelas and Leben's basic insights but incorporates them into a componential model in which register is one of several interactive parameters of F o assignment.
The status of register in intonation theory
Before considering these proposals more closely it will be useful to characterize "range" and "register" as these terms are commonly understood. Range refers to the largest interval (or frequency band) within which tones are normally produced in the speech of a given speaker. Register is a smaller interval or frequency band internal to the speaker's range, which determines the highest and lowest frequency within which tones can be realized at any given point in an utterance. Register may be shifted downward under certain phonological or grammatical conditions, giving rise to the effect commonly known as downstep. Crucially, a shift downward in register affects the level at which all subsequent tones in the new register are realized, not just the immediately following tone. These concepts have long been familiar to Africanists, and were well understood in an informal way in work as early as that of Jones and Plaatje (1916). They are introduced in the formal description of tone languages by Clements (1979, 1981), Yip (1980), and Huang (1980), and are proposed for the formal description of pitch-accent systems in the present paper by Ladd. It is important to keep in mind that systems of pitch implementation rules can describe global changes in register without necessarily treating register as a distinct entity in representation. Much of the earlier Africanist literature was concerned with working out models in which downstep was characterized by left-to-right pitch assignment rules, making no crucial reference to register as such (see Clements 1979 for some references), and Pierrehumbert (1980) has proposed rules with a similar effect for English. The larger issue is not whether register is a useful concept in the study of intonation, since there is widespread agreement that it is, but whether it must be expressed in terms of explicit representational structures. Ladd proposes to characterize pitch register in terms of binary-branching trees which are similar to those proposed for African tone languages by Huang (1980) and Clements (1981), except that they are identical (aside from labels) to the metrical and/or syntactic structures that are needed for independent reasons.1 Each pair of sister nodes in a tree is labeled either [h, 1] or [1, h], and each new right branch labeled " 1 " lowers the register by one step. In substantive terms, Ladd's account differs from the Huang/Clements approach primarily in viewing pitch accent in languages like English as being intimately linked to stress patterns and syntactic structure. Pitch realization is determined by the Relative Height Projection Rule (RHPR), which is paraphrased below (HTE = Highest Terminal Element): (1)
Relative Height Projection Rule: In any metrical tree or constituent, the HTE of the subconstituent dominated by / is one register step lower than the HTE of the subconstituent dominated by h, iff the /subconstituent is on the right branch.
59
G. N. CLEMENTS
(1) assigns sequences of integers to tonal strings in much the same way as do the earlier metrical proposals. These integers define register steps, rather than F o directly. Thus the integer 3, for example, indicates that a unit has been shifted down in register three times. One difference between Ladd's account and the earlier metrical models of tone is that constituents labeled [lh] do not define an upstepping pattern, but rather absence of downstep. To see this, compare the step pattern assigned to a ternary right-branching structure by Ladd's rule, and by the corresponding rule given by Clements (1983:167): "add an increment of 1 to each tone for each / dominating it in the tree." These patterns are compared in (2); the initial value for a nonstepped H tone is set here at 0: .
(2)
.
A /A
h H L: C
:
h
l
h
l
/A h
l
h
/A
l
l
l
h
H H H H H H H H H H H 0 1 2 0 1 1 0 0 1 0 0 0 0
1
2
0
2
1
1
0
1
1
1
0
We observe that both of Clement's upstepping sequences (in the two central figures) correspond to plateaus in Ladd's outputs. In fact, Ladd treats rises in pitch between two H accents very differently from falls in pitch. Rises between H accents are accounted for in terms of a prominence factor, p, which assigns reduced prominence (undershoot of the F o target) to the first of two accents. This is used, for example, in accounting for the upstepping accents in examples such as his (5). One wonders, however, why he does not choose the more straightforward solution of accounting for them directly in terms of reversed ([1, h]) labeling, especially since phrase-internal pitch variation is treated elsewhere in Ladd's model in terms of register shift, rather than gradient local variability (as in many other approaches). This brief summary of Ladd's position is sufficient to give an indication of the central role of register in his model, and also suggests certain questions. First, given the fact that intonational trees are built upon independently-needed metrical and syntactic structures, to what extent do they play a crucial role in accounting for linguistic generalizations? Could the metrical/syntactic trees themselves be sufficient to account for nonlocal dependencies of the sort observed in Ladd's experiments ? And second, given the fact that Ladd's realization rules recognize a gradient factor of prominence in any case, couldn't this factor, if suitably constrained, take on the entire burden of accounting for the type of F o variability discussed here without requiring reference to metrically-defined register? 6o
The status of register in intonation theory
Let us consider these questions in turn. Ladd does not give explicit rules for labeling trees, but it is crucial that the h/1 labeling must not correspond to the s/w labeling of metrical stress trees at least in some cases, since otherwise the two systems would be isomorphic and the metrical structures alone would be sufficient for pitch interpretation. But in many cases the two types of trees seem to have the same structure, and corresponding labeling. For instance, consider the example Allen is a stronger campaigner (which forms a portion of the longer sentence discussed in the context of experiment 1), in which Allen has the highest pitch. The registral tree needed to describe this right-branching structure is given in (3a), where the numbers are register steps, as discussed earlier. A plausible metrical tree for this sentence is given in (3b). (3)
s Allen is a stronger campaigner 0 1 1
w
s
Allen is a stronger campaigner
If we assume that greater metrical prominence is realized (in part) by higher F o , the tree in (3b), by itself, provides an adequate input to the pitch interpretation rules.2 The registral tree in (3a) does not seem to be adding any crucial information. It is not clear whether a similar analysis can be given of all Ladd's examples, but the matter clearly deserves some attention. This question is independent of the second one - whether the observed register shift effects could equally well be treated in terms of local prominence. In Ladd's account, register shift involves a global resetting of the frequency band available for the realization of H and L tones in the register span affected. Thus, as his figures (3.2) show, we expect register shift to affect all values in the span. But Ladd's examples of register shift do not have this character. Experiment 1, for instance, compares the pitch of the HTE of clauses B and C in sentences of the type (4)
a. [[A and B] but C] b. [A but [B and C]]
Ladd argues that, given these bracketings, we would expect the peak of clause B to be scaled higher in (4b) than in (4a), and the peak of C to be scaled higher in (4a) than in (4b), since clauses following but are attached higher in the tree. This result is verified for at least one subject. However, as Ladd himself points out, the 6i
G. N. CLEMENTS
first of these predictions does not follow from the RHPR, which assigns the following pattern of register steps to the HTEs of each phrase: (5)
b.
Ladd sets this discrepancy aside as a problem for further research, but it seems crucial to the argument. Moreover, and of particular importance to the point under discussion, if there were a registral distinction between the B and C conjuncts across both bracketing conditions, we would expect it to affect all accents in each conjunct. Yet as Ladd's figure (3.3) shows, only the pitch of the initial pitch-accent (the HTE) in each one is affected: the second and third remain constant across the two conditions. This seems exactly the sort of effect we might expect if local prominence were involved, determined by relevant aspects of the syntactic configuration.3 It seems quite possible that the F o dependencies observed in the other experiments can also be interpreted in terms of local prominence (this is in fact Ladd's interpretation of the results of experiment 2). A further interesting question is how Ladd's model would extend to the sort of tone language data represented by Hausa, in which register shift seems primarily conditioned by the pattern of H and L tones, rather than by stress and syntax. Perhaps Ladd would consider these two systems to be typologically distinct, requiring different formalisms: thus register trees in systems like English would be fitted over syntactic/metrical trees, while in Hausa they interpret the tonal string alone. If this is the correct interpretation - and I may be going beyond Ladd's intentions here - it would make the interesting prediction that we should not find "mixed" intonational systems in which register shift is conditioned by stress and/or syntax as well as by the tone pattern. Ladd is cautious about making strong claims based on his current experiments, but his work clearly demonstrates the interest of looking more closely into the link between syntax, metrical structure and F o assignment in both accentual systems and tone systems, an area which has been badly neglected up to now. Moreover, a system such as Ladd's might offer several real advantages compared to earlier, nonhierarchical approaches. Such approaches, in attempting to express downward pitch trends across the phrase without reference to "global" notions such as register, often chose to multiply distinctions among various types of tones or pitch62
The status of register in intonation theory
accents, making use of diacritic features to distinguish those that induce downstep from those that do not. It is a clear advantage of register-based models such as Ladd's that they allow a simpler and less arbitrary inventory of basic tonal or pitch accent types. Furthermore, Ladd's cautionary remarks against the use of unconstrained prominence-assigning rules as a "wild card" in intonational theory are well taken, and lead us to hope that a register-based model will be able to impose appropriate constraints on the scope of rules of this type. It is particularly welcome to find Inkelas and Leben introducing tone language data into the debate on F o models. Phonologists have had great difficulty agreeing on the correct way to represent intonation patterns in stress/accent languages like English, but can often come close to total agreement on the identity of the tonal strings characterizing words and phrases in tone languages. For this reason tone languages give us a privileged vantage point for factoring out the particular contribution of intonational rules to Fo contours. In view of this observation, it is all the more surprising that hard phonetic data on F o in tone languages are still very scarce. One hopes that pioneering projects such as Inkelas and Leben's will lead the way for others in the near future. Inkelas and Leben propose that there is a phonology of intonation separate from the phonetics. In their view, treating registral features such as downstep as phonological rather than phonetic allows us to capture linguistic generalizations that could not be captured otherwise, and reduces the size and perhaps the complexity of the phonetic implementation component. They propose to account for phonological aspects of intonation in part through the introduction of a register tier, drawing on an earlier proposal by Larry Hyman (1986). In their view, tonal representation in languages with register shift phenomena involves two tonal tiers, termed the primary tier and the register tier, both containing the same elements, H and L tones, linked under a common tonal "class node." Downstep is represented by assigning a L tone on the register tier and spreading it to all following tones that are realized in the same register. Thus, for example, a sequence LH! LH (where the! indicates downstep) is represented as follows: (6)
L o L
H o
L o
H
Primary tier
o L
Register tier
Upstep (return to a higher register) is indicated by H tones on the register tier. In this model, then, register lowering is phonologically characterized in terms of L tones on the register tier. By itself, however, this representational system does not determine any particular phonetic output. Register tones, like all phonological
G. N. CLEMENTS
features, require an interpretation at the phonetic level - this is just the function of the implementation rules. Inkelas and Leben have little to say on exactly what implementation rules look like, and so we must speculate on exactly how representations like (6) are mapped into actual F o contours. It would not be unreasonable to assume that there is a "default" register setting, set toward the top of the range for any given speaker. In this case, the obvious interpretation of the L register tone would be "lower the register by a certain amount."4 Now let us consider how the corresponding register lowering rule would be stated in a model similar to Inkelas and Leben's, but which does not make use of the register tier. The necessary rule would be the same in form-"lower the register by a certain amount" - but it would interpret primary L tones rather than register L tones. But since primary L tones and register L tones are in one-to-one correspondence, this rule will affect the tonal string in exactly the same way. It seems, then, that the register tier has not added any information crucial to phonetic implementation, in this case. Consider now the role of the register tier in accounting for certain features of question intonation. In the final phrase of questions, downstep is suspended, and each H tone is assigned a higher F o value than it would have in the corresponding declarative (even if we abstract away from the independent effects of Global Raising, which resets questions to a higher register throughout). Inkelas and Leben account for this local suspension of downstep by a rule of Question Raising which inserts a register H over each primary H. This register H is interpreted as shifting the register up (see note 4). The effect of the register Hs is to cancel the effect of each preceding register L. Thus in a tone sequence of the form LHLH..., each L is realized in a lowered register (call it register «), and each H is realized in a higher register (call it register m). As each H tone is produced in a single register (register m), and each L is produced in a single register (register «), neither the Hs nor the Ls are assigned declining F o values. This mechanism produces the effect of suspension of downstep, as we require. It would be straightforward to mimic the effect of this analysis in a model without a register tier by allowing Question Raising (upward shift in register) to be triggered by primary Hs directly, rather than through the intermediary of the register Hs. This analysis would account for the suspension of downstep (and the resulting complementarity of downstep and Question Raising) in just the same way that the register tone analysis does. Inkelas and Leben suggest, however, that such an approach will not extend to a further generalization. Certain emphatic particles in Hausa, termed ideophones, are always pronounced with an extra-high tone, in statements and questions alike. However, occurring finally in questions they have the same F o value as the extra-high tones produced independently by the combination of Question Raising and a further rule mentioned in 7 and 8 (which I will term Final Raising), which assigns extra F o prominence to the final H tone of a question. Inkelas and Leben point out that on the register tier analysis, this 64
The status of register in intonation theory
lack of contrast is predicted, since the formal representation of the H tones of ideophones in questions is the same as that of any other final H tone element in questions: a primary H linked to a register H. The system allows no further contrasts. The crucial assumption here is that an implementation rule treatment could not predict this effect in a straightforward way: if several implementation rules assign a shift in register to the same tone-bearing unit, these shifts should be cumulative in effect. However, the formulation of Final Raising appears to be crucial to the argument. According to Inkelas, Leben, and Cobler (1986), this rule raises "a phrase-final High-toned syllable linked to a register H essentially to the top of the speaker's register," where "register" is apparently to be understood as "range" in our sense (since H tones are already at the top of the register). This description is consistent with the data, which show that the final H rises to values around 260 Hz (figure 25), which are well at the top of the speaker's range. If this formulation is correct, however, any additional instruction to raise the register would be without effect, since the top of the range has already been reached. Therefore, we need not appeal to register tones to account for the noncumulative effect of raising, since this result follows from independent considerations.5 In adopting an autosegmental formalism for the description of register, Inkelas and Leben invoke an extremely rich notational system. A further way to demonstrate the need for the register tier would be to show that the full power of this formalism is required to account for certain linguistic generalizations. For example, we could try to discover instances of spreading tones, contour tones, floating tones, tone melodies, and other familiar diagnostics of autosegmental representation involving the register tier. It is perhaps significant that none of this power seems to be required to account for the phenomena discussed by Inkelas and Leben. For example, although they define register spans by spreading register tones rightward, this spreading does not seem to be crucial to pitch implementation.6 Where they seem clearly correct is in their belief that register - whatever its formal characterization - is a fundamental concept in tonal analysis. In the following section I will offer some ideas of my own on how register can be formally characterized in intonation theory. The view I will present is a componential one, in which phonetic F o values result from the interaction of a variety of partly independent factors. These can be summarized as follows: (7)
Some factors in Fo assignment: i. Prosodic units: tone groups, intonational phrase, etc. (determine the domains within which tone rules apply) ii. Melody composition (nature of the tone melody itself, including lexical, grammatical and boundary tones)
65
G. N. CLEMENTS iii. Register effects (affects tone sequences): a. downstep, upstep (register shift) b. global raising (affects the entire intonational unit, as in Hausa questions) iv. Declination: the steady, time-dependent decay in Fo, independent of the composition of the tone melody and of register effects v. Local Fo adjustment (affects individual tones) a. tonal assimilation and dissimilation (phonologically conditioned) b. domain-edge adjustments (conditioned by prosodic boundaries) c. expressive raising (conditioned by focus, contrastive emphasis, etc.)
The question I would like to consider here is how the factors grouped under (7iii—v) can be treated in an intonational model. I will suggest a strategy of F o assignment that provides a formal distinction among the mechanisms for treating register, declination, and local F o effects. This strategy provides direct recognition of the fact that surface F o contours in tone languages may result from an overlay of all three of these mechanisms. In this approach, the full power of autosegmental or metrical formalisms is not called upon. For example, we may be able to require that the direction of falling (or rising) register ramps is constant throughout certain well-defined prosodic domains, such as the tone group, as has been widely assumed in much of the earlier descriptive literature on African tone systems.7 In my discussion of Ladd's experimental data, I have suggested that some of the apparent registral effects that raise a problem for such a claim might be alternatively treated in terms of local F o adjustment. I have similarly suggested that Inkelas and Leben's treatment of question intonation in terms of alternating rises and falls in register might be treated as involving a single, unchanging register (note 5). While these suggestions are necessarily tentative, they can easily be tested. The alternative approach develops out of work that I have carried out in collaboration with Elizabeth Leung (Clements and Leung 1986), stimulated in part by Leung's work on Llogoori, a Bantu language in which downstep is triggered by each new occurrence of a H tone (Leung 1986). A Llogoori sentence, constructed in order to exhibit the recursive nature of downstep, is given in (8). This example involves three downward shifts in register, after each of the H tones. The first H tone is produced at about the same level as a sentence-initial H tone in other sentences, and we take this as evidence for assuming that a LH sequence belongs to a single register span. Within each of the two longer H-tone spans, H tones show a more gradual drop. Other sentences show a similar effect for L-tone spans, and we taken this as provisional evidence for a process of declination, applying as a constant function across the sentence independent of the identity of the tonal string. The effect of register drop (downstep) between two H tones is rendered less dramatic by the concurrent declination, but is perceptibly greater than the F o drop that would be expected from declination alone, and is entirely predictable from the nature of the tonal configuration. 66
The status of register in intonation theory (8)
"we've just caused (somebody) to eat very slowly"
250 240 230 Hz 220 210 200 190 180 170
\ kU
rll
hi
zi
ga
ra
ha
m
no
H
L
The model treats registral and declinational processes in terms of global functions, and local processes in terms of local rules. Registral structure is expressed in terms of uniformly right-branching and right-headed trees. These trees bear more than a superficial resemblance to metrical trees, but there is normally no reason to treat them as pervasive in phonological representations. That is, phonological rules do not access them or act upon them (but see note 7), nor do they have to be accessed by pitch interpretation rules at any point other than surface representation. They can be regarded, if necessary, as part of the rule application algorithm much as has been suggested for segmental metrical rules by Poser (1982): they are erected purely for the purposes of F o assignment, and cannot be referred to by rules of other types. A schematic view of the way in which register and local F o effects interact in this model is given in (9). Each register span is gathered into a tonal foot, and tonal (9)
registral structure
L
L
H
L
X
X
y
X
d° d° d° d
1
I
\
I
H
H
H
y
y
y
d
1
P
(to be computed)
d
2
d
3
J- tonal feet BTV DF LI DTV 67
G. N. CLEMENTS
feet are grouped into right-branching trees. This structure provides the necessary input to the principles and rules assigning F o contours. The basic tone value (BTV) is the idealized invariant value assigned to each distinct tone level, here H and L. These are the values at which each tone would be realized in the absence of any other factors influencing its realization, and correspond to the values these tones normally assume in the initial span of declarative utterances, where they are not influenced by register shift. I leave open the question of whether this value should be given in terms of F o or some more abstract, speaker-independent unit. The downstep factor (DF) expresses the amount by which a given tone is lowered from its basic value. This is expressed provisionally as a fraction, d, whose values range between 0 and 1, and which is raised to a power x identical to the number of/s dominating it.8 This treatment of downstep scaling is inspired by the rules for downstepping accents in English given by Pierrehumbert (1980: chapter 4). It allows us to express the fact that a sequence of downstepped H tones will fall toward an asymptote located well above the bottom of the speaker's range. Based on our initial observations, this may be correct for Llogoori, and seems to fit Inkelas and Leben's data for Hausa, but requires more systematic verification. Local increments (LI) are assigned by rules of the sort listed in (7v). For example, the increment p in (9) might be introduced by a rule assigning extra F o prominence to the last H tone before a L, as is often observed in African tone languages (see e.g. Tucker and Creider 1975 for data from Luo). Notice that such rules apply only locally and have no consequences for the tones following them. I provisionally assume* values for p to be absolute increments or decrements, although I know of no evidence against regarding them as ratios. The derived tone value (DTV) of any tone is a function of the value it has received for each of the components BTV, DF, LI, and is defined by the following equation: (10)
DTV = (BTV*DF)+p1+p2 + ... +p n
Naturally this is a simplified presentation and does not yet, for example, take declination into account. Nevertheless, a model incorporating the properties of (9) has certain advantages, potential and real. First, by treating register, declination and local F o adjustments as independent variables, it expresses the fact that these may vary independently under different linguistic conditions and across different speakers and languages. Second, it is a more constrained model than one that is similar in other respects but which invokes the full power of autosegmental or metrical frameworks. It does, like most other models, invoke the power of p, a local, gradient variable, and it remains to be seen how its use can be suitably constrained; I have suggested in (7v) that it can be introduced under just three types of conditions, and therefore does not have the status of a completely unconstrained "wild card." Third, by factoring out the BTV as a separate 68
The status of register in intonation theory
variable, this model extends in a straightforward way to the description of more complex tone languages, such as those with three, four, or five tone levels, some of which are discussed in Clements (1981). Fourth, and more speculatively, the present system would be quite compatible with the existence of languages in which tonal and syntactic/metrical considerations could play a joint role in determining register shift phenomena. At present, I know of no evidence bearing on this question. My aim in these remarks has been to examine the nature of register in the light of the empirical studies by Ladd and by Inkelas and Leben and to raise the question of how it should be characterized in a formal theory of intonation. I have suggested that while it may not be necessary to resort to the full power of autosegmental and metrical notations to account for registral effects, register nevertheless plays a central role in accounting for F o contours, and requires separate formal expression in intonation theory. I hope these comments will serve the purpose of bringing some of these issues into sharper focus.
Notes 1 Thus Ladd's model requires no special rules of tree construction, placing a strong constraint on the expressive power of his theory. 2 I follow Pierrehumbert (1980) in assuming that normal [sw] node labeling may be reversed at any level of the tree to highlight particular information in the phrase. In one respect the tree in (3b) provides a superior input to the rules assigning F o , in that it characterizes campaigner as more prominent than stronger (see Ladd's example 3). 3 It is possible that it is not depth of embedding, but the lexical choice between and and but that accounts for the differences between these examples. One way to test for this alternative while controlling for purely syntactic differences would be to test pairs of conjuncts of the form [A and B] vs. [A but B]. Alternatively one could control lexical choices while varying the type of nested structure, for example by comparing homophonous sentences differing only in phrase structure such as the woman hit the man with a camera, or sentences in which different phrase structures are imposed by different, but semantically neutral lexical choices, such as Mary wished that John would love him with all his/her heart. 4 This interpretation seems to be in accordance with the intentions of the authors. In a preliminary version of this paper, Inkelas, Leben, and Cobler (1986) state: "we interpret each register tone literally as setting up a new register for subsequent tones, and... we interpret High register as the instruction to go up in F o by a certain amount and Low register as the instruction to go down by a similar amount." 5 There is a very different way of viewing the suspension of downstep of question intonation, which is to assume that there are no changes of register at all (whether downward and upward) in this context. Thus, for example, Laughren (1984) accounts for the suspension of downstep in Zulu questions by assigning all tones to the same register. If this treatment can be extended to Hausa questions, we would expect ideophones (and emphasized words) to contrast with other words in nonfinal position
G. N. CLEMENTS
in questions (where the rule of Final Raising does not apply). Unfortunately, Inkelas and Leben do not provide discussion of such examples in their paper. 6 Some of these issues, and further arguments for the register tier, have been discussed in independent work by Inkelas (1987). 7 A possible counterexample to this claim (from a dialect of Zulu) is discussed in Clements (1981). 8 Thus we may take the register-step integers assigned by Ladd trees (or Clements/Huang trees) as the value of jr, and compute the DTV by assigning the appropriate downstep factor to any tone in any order. Consequently, left-to-right, self-feeding iterative application is not a necessary mode of application for F o assignment rules. The system outlined in Ladd's appendix (which did not appear in the conference version) adopts a similar treatment of downstep, as far as I can see, and appears to have the same consequence.
References Clements, G. N. 1979. The description of terraced-level tone languages. Language 55 (3): 536-558. 1981. The hierarchical representation of tone features, In G. N. Clements (ed.) Harvard Studies in Phonology vol. 2, Bloomington: IULC; reprinted in I. R. Dihoff (ed.) 1983. Current Approaches to African Linguistics vol. 1, Dordrecht: Foris, 145-176. Clements, G. N. and E. Leung. 1986. Downstep without floating tones. Paper presented at the Workshop on Nonlinear Phonology, University of Leiden, June 1986. Huang, C. J. 1980. The metrical structure of terraced-level tones. In J. Jensen (ed.) NELS (Cahiers Linguistiques d'Ottawa, vol. 9), Dept. of Linguistics, University of Ottawa, 257-270; expanded as: On the autosegmental and metrical nature of tone terracing. In D. L. Goyvaerts (ed.). 1985. African Linguistics. Ghent: StoryScientia, 209-238. Hyman, L. 1986. The representation of multiple tone heights. In K. Bogers, H. van der Hulst and M. Mous (eds.) The Phonological Representation of Suprasegmentals. Dordrecht: Foris, 109-152. Inkelas, S. 1987. Register tone and the phonological representation of downstep. In I. Haik and L. Tuller (eds.) Current Approaches to African Linguistics vol. 6, Dordrecht: Foris. Inkelas, S. and W. R. Leben (this volume). Where phonology and phonetics intersect: the case of Hausa intonation. Inkelas, S., W. R. Leben, and M. Cobler. 1986. The phonology of intonation in Hausa. MS., Stanford CA: Stanford University (conference draft for Inkelas and Leben, this volume). Jones, D. and S. T. Plaatje. 1916. A Sechuana Reader. London: London University. Ladd, D. R. (this volume). The metrical representation of pitch register. Laughren, M. 1984. Tone in Zulu nouns. In G. N. Clements and J. Goldsmith (eds.) Autosegmental Studies in Bantu Tone. Dordrecht: Foris, 183-234. Leung, E. 1986. The tonal phonology of Llogoori. Master's dissertation, Cornell University. Pierrehumbert, J. 1980. The phonology and phonetics of English intonation. Ph.D. dissertation, MIT. 70
The status of register in intonation theory
Poser, W. 1982. Phonological representations and action-at-a-distance. In H. van der Hulst and N. Smith (eds.) The Structure of Phonological Representations Part II.
Dordrecht: Foris, 121-158. Tucker, A. N. and C. A. Creider, 1975. Downdrift and downstep in Luo. In R. K. Herbert (ed.) Proceedings of the 6th Conference on African Linguistics (Ohio State University Working Papers in Linguistics 20), 125-34.
Yip, M. The tonal phonology of Chinese. Ph.D. dissertation, MIT.
5 The timing of prenuclear high accents in English KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
5.1
Introduction
Speaking English means doing two things at once: saying the words, and saying the melody. The coordination between these two activities is far from arbitrary. Rather, it is determined by the stress pattern and phrasing of the utterance. Some features of the melody fall on certain stressed syllables, while others fall at the boundaries of prosodic phrases. In this paper, we will be concerned with the timing of pitch accents, the melodic features which fall on certain stressed syllables. We shall examine the way in which prenuclear pitch accents are aligned with their associated syllables in a variety of prosodic environments, and our phonetic data will allow us to address two related issues. One of these is the degree of similarity between prenuclear and nuclear pitch accents - one of the longstanding points at which the American and British traditions of intonational analysis diverge. The second issue, which we believe supersedes the former, is the nature and status of the process by which underlying phonological forms are given their surface phonetic realization. From a phonological point of view, the association of pitch accents with syllables can be described using autosegmental links. An example transcription is given in (1). The accents are transcribed using the system of Pierrehumbert (1980), and prosodic structure above the syllable level is omitted.1 (1)
[mama lem]
phoneme tier
VV V a- a- cr
I
syllable level in the prosodic structure
I
H* H + L*
melody tier
Links between elements constrain them to overlap, as they are produced in time. For instance, the first accent (H*) and the first two phonemes ([m] and [a]) are phonologically specified to occur on the first syllable.
The timing of prenuclear high accents in English Speaker RWS; postnuclear syllables = 0 , 3
0.20F
0.15-
0.10-
s Q_
0.05-
0.0 0.1 0.2 0.3 0.4 Vowel length (in seconds), all speaking rates, all vowels Figure 5.1 Position of nuclear peak, relative to vowel onset with either three (3) or zero (0) postnuclear syllables. From Steele (1986).
Such a transcription, in which the coordination between words and melody is mediated through the prosodic structure, is very nonspecific about exactly how segments in the two tiers are overlapped during pronunciation. And this is as it should be, because the phonological representation should capture only linguistic contrasts. Phonetic implementation rules, by contrast, have to specify how the phonological structure is realized in actual speech. Phonetic observations reveal a great deal of variation in how the realization of the pitch accents is coordinated with the realization of the speech segments. Furthermore, this variation is systematic. This point is illustrated in figure 5.1 with data from Steele (1986).2 This study investigated the timing of the fundamental frequency peak for a nuclear H* accent,3 as a function of speech rate and of the number of postnuclear syllables. The figure plots the peak delay (the distance of the peak from the vowel onset of the nuclear syllable) against the length of the vowel. Points labeled " 0 " are for utterances in which no syllables followed the nuclear stress, produced as the response in: Did John write to Sue? - No, he wrote to NAN. 73
KIM E. A. SILVERMAN AND JANET B. PI ERREHUMBERT
Points labeled " 3 " are for utterances in which three syllables followed; these were produced as the response in: Did John write to SUSIE Moore Lane? - No, he wrote to NANA Moore Lane. Regression lines summarize the trend for each of these two cases, as speech rate is varied. Points to the right within each group represent longer vowels, corresponding to utterances spoken at the slower rates. The way the 3's and O's separate out into two distinct clouds makes clear that the peak is much earlier, relative to the total vowel duration, when no syllables follow than when three do. A related point is that speech rate interacts differently with peak alignment than phrase-final lengthening does. If they interacted in the same way, then the 3's and O's would be located on top of each other in one cloud instead of two: it would be possible to predict the peak delay from the vowel length alone, without knowing the prosodic configuration. Clearly, this is not possible. Here, we address two questions raised by Steele's data and other data like them. The first is: are prenuclear and nuclear accents similar or different in the tonal phonology? Theories of English intonation in the British school, such as O'Connor and Arnold (1973), use a different phonological inventory for prenuclear and nuclear accents. Within such a framework, phonetic differences between the two types of accent would be expected as a matter of course, because they would be the surface realizations of an underlying difference. Pierrehumbert (1980), on the other hand, claims that English has the same inventory for both types of accent. Furthermore, a uniform phonetic realization process is claimed to be responsible for explaining how accents in both positions are pronounced. This theory leads one to expect that data similar to those in figure 5.1 would also arise for prenuclear accents as prosodic context is varied. Silverman (1987) in fact questions this aspect of Pierrehumbert's phonology for a number of reasons. A survey of the relevant phonetic studies of F o contours showed a consistent tendency for nuclear peaks to be aligned much earlier in their syllables than prenuclear peaks. Alignment rules proposed by previous researchers (Mattingly 1966; Pierrehumbert 1981) have treated nuclear and prenuclear positions differently. In his own F o synthesis model, which incorporated Pierrehumbert's intonational phonology as one of its components, Silverman found that his phonetic implementation rules needed to know whether an accent was nuclear or prenuclear in order to compute how it should align with its associated syllable. The resultant alignment differences were crucial for the quality of the synthetic speech. Taken together, these considerations seemed to bring the underlying unity of prenuclear and nuclear accents into question. This issue is attacked directly in our study, which examined the alignment of accent peaks in prenuclear position. Prosodic context was varied by varying the 74
The timing of prenuclear high accents in English
proximity of the accent to a word boundary and by varying the number of syllables separating the prenuclear and nuclear accents. Speech rate was also varied. In general, differences similar to those in figure 5.1 emerged, supporting the hypothesis that prenuclear and nuclear accents are represented and realized by a uniform mechanism. The second question is why differences like that in figure 5.1 occur. The observed contrast in alignment is apparently related to the fact that the " 0 " syllables are lengthened because of their prosodic position (i.e. utterance-final), whereas the " 3 " syllables are not. However, the mechanism for this relationship is unclear. Some of the alternatives include: Invariance. The F o rise for a H* accent has an invariant duration for any given speech rate. The position of the peak relative to its syllable varies artifactually when independent factors (such as utterance-final lengthening) alter the segment durations without affecting the rise duration. In this account, observed differences in peak alignment when prosodic context is varied are considered to be consequences of low-level phonetic properties of speech. Gestural overlap. The relatively early peaks observed for the " 0 " points occur because the articulatory gesture for the accent is interrupted by the gesture for the following Low (L) tone which marks the end of the phrase. The relation of the peak placement to the duration pattern arises indirectly, as a consequence of the fact that phrase ends in English are marked by tones as well as by lengthening. Similar effects could be found in prenuclear position when another accent occurs before the gesture for the first is completed. Tonal repulsion. The articulatory gesture for the accent is moved earlier in time, in order for the accent and the following tone to be fully pronounced in the time available. The relation to the duration pattern arises the same way as in the overlap hypothesis. Phonological mediation. Structural prosodic features modify the phonological representation in a way that speech rate changes do not, such as by adding extra beats to the metrical grid. The modified phonological representation causes certain syllables to be lengthened, and also gives rise to a difference in alignment of the pitch accent. Sonority profile. The opening and closing gestures for the syllable give rise to an increase and decrease in sonority (where we define sonority loosely in terms of the overall openness of the vocal tract or the total impedance looking forward from the glottis). The sonority profile, or the time course of sonority for the syllable, differs between lengthening and nonlengthening environments because the closing gesture is more extended by prosodic lengthening than is the opening gesture. The F o gesture for the accent is coupled to the entire sonority profile of the syllable (not just aligned with the vowel onset, as under the invariance theory). The exact form of the coupling interacts with the different 75
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
effects of rate and prosodic lengthening on the sonority profile, and thereby yields differences in alignment. All five of these theories predict that alignment differences will be found in prenuclear as well as nuclear position, as the prosodic context is varied. They differ among each other in their detailed predictions, and this will make it possible to compare them using data from our experiment. For example, the invariance and phonological mediation theories provide no mechanisms for varying alignment without co-varying duration. According to these accounts, differences in syllablerelative peak alignment must be either mere artifactual consequences of prosodically-induced lengthening (invariance), or else are induced at the same time by the same prosodic triggers (phonological mediation). Undershoot and tonal repulsion, on the other hand, do provide such a mechanism for varying peak placement independently of duration: contexts can be contrived in which the distance between tones varies without varying the structural factors relevant to the duration rules, and effects on peak alignment are predicted in such cases. The sonority profile theory permits contrasts in alignment pattern among syllables with the same overall duration, but only if syllable-internal details of the timing pattern vary. In general, the data supported neither the invariance nor the phonological mediation theories. Invariance failed because different prosodic contexts at the same speech rate showed systematically different absolute distances between the vowel onset and the F o peak (as well as different relative positions of the peak within the syllable). Phonological mediation failed because the results suggested that alignment differences were gradient (ie. continuous) rather than discrete. A particular difficulty for the phonological account arose in one subset of the data, in which a systematic difference in alignment was not related to a difference in duration; the account offers no way of handling this case or of relating it to the rest of the data. The gestural overlap and repulsion theories were somewhat more successful, but still failed to provide a full explanation because the absolute distance between accents was less important than the right-hand prosodic features in determining peak placement. The subset of the data that caused difficulties for phonological mediation tends to support an account based on tonal repulsion, although it could be accommodated within the sonority profile approach if additional assumptions were borne out. A small investigation of syllable-internal timing in fact provided some evidence for an account based on sonority profiles. Our conclusion is that observed alignment differences arise through gestural overlap, through tonal repulsion, through coupling to the sonority profile, or (most likely) through a combination of these three mechanisms. We can step back from these particular alternative formulations to make some observations about how these theories relate to each other. They differ on which components of speech production access which aspects of the phonological 76
The timing of prenuclear high accents in English
representation, and on how they make use of it. The phonological mediation theory represents one extreme. In this approach, the phonetic implementation process does not have direct access to the hierarchical prosodic structure. Rather, the phonetic rules can only refer to an intermediate phonological representation in which those structural characteristics that are responsible for the observed phonetic differences have been explicitly re-encoded. The invariance theory tends in the opposite direction, by removing from the phonological representation the burden of generating differences in peak alignment. It shares with the phonological mediation theory the view that the rules responsible for producing the F o contour for any pitch accent do not themselves access an utterance's hierarchical structure, but it does allow the duration rules to do so. This approach thereby somewhat enhances the status of phonetic implementation processes in explaining the sound patterns of speech. Gestural overlap is similar to invariance in this regard, although it takes a less simplistic view of the relationship between the underlying tonal sequence and its surface realization in a particular utterance. Tonal repulsion requires the accent realization rules to access the hierarchical prosodic structure in a particularly powerful and complicated way. Instead of the alignment varying according to structural features in the right-hand context, it requires a computation of their detailed phonetic consequences — an estimate of how long in absolute time units until the next tonal element - before any accent can be pronounced. The account in which F o peak alignment is related to the sonority profile lies somewhere between the above extremes. The form of the sonority profile for a syllable is influenced by factors which can be directly read from the hierarchical prosodic structure, without any need for an intermediate level of representation. Accent placement is coupled to the overall shape of the sonority profile, and so the same contextual factors which give rise to prosodically-induced lengthening also determine peak placement. In what follows, we shall consider our phonetic data in the light of the above alternative formulations. We shall present evidence that phonetic implementation rules for F o and duration must directly access an utterance's hierarchical prosodic structure and use this information to compute the alignment of accents with their associated syllables.
5.2
Method
5.2.1 Materials Two adult speakers took part in the experiment: one male (RWS - the same speaker whose data for peaks in nuclear position are shown in figure 5.1) and one female (JBP). Each speaker produced names of the form Ma Lemm, Mom Le Mann, Mamalie Lemonick, and Mama Lemonick, with all twelve combinations of 77
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
Speaker: JBP
-Alternative peak locations £
20(H 150 100
alie
Lemm
2 53
H*
0.5 Seconds H + L*
L
L%
Figure 5.2 Extracted Fo contour for Mamalie Lemm, spoken by JBP at the slow rate. The vertical lines numbered 0 through 4 represent measurement points for the segmental durations. Line number 5 is the chosen peak location, midway between the two possible candidates labeled with arrows.
the four first names and three surnames.4 Names possessing entirely voiced sonorant consonants were used in order to minimize segmentally-induced perturbations on fundamental frequency. These names are chosen in order to vary the rhythmic configuration between the two syllables with word stress. The names differ in the number of syllables separating the stresses (0 to 3) and in the location of the word boundary with respect to the stresses. In some of the names (those beginning with Mamalie and Mama), the accented syllable does not immediately precede a boundary. In two {Ma Le Mann and Mom Le Mann), it is lengthened because it is the last syllable before the word boundary. In the remaining two {Ma Lemm and Mom Lemm), it is lengthened by two effects - it precedes a word boundary and it participates in a stress clash.5 Subjects used a H # H + L* melodic pattern, which was illustrated by example to subject RWS. This pattern has a peak associated with the prenuclear stress, followed by a plateau and a step down onto the syllable with nuclear stress. The utterance ends with a low pitch, transcribed as the sequence L L%. An example is given in figure 5.2. This pattern was selected because it offers a conspicuous, measurable F o feature in prenuclear position, and because it seemed to encourage productions free of major phrasing breaks. o 7**
The timing of prenuclear high accents in English
Each speaker produced, in randomized order, five repetitions of each combination in each of three speaking rates (fast, normal and slow), yielding altogether 360 utterances. Randomization was carried out in blocks, to control for practice and fatigue effects within the recording session. 5.2.2
Measurements
Measurements were made of the segmental durations and the F o maximum corresponding to the prenuclear H*, using a computer F o track and waveform display shown in figure 5.2. In Ma, Mom, and Mama, the durations of all segments were measured. In general, we believe that the time points for transitions between / m / and vowels are quite accurate: enlargement of the relevant section of the acoustic waveform clearly revealed the boundary between the smooth glottal periods belonging to the interval of lip closure and the glottal periods showing higher frequency components (related to the formant structure) that appeared as soon as the lips were (even slightly) open. In addition, the waveform typically had a higher amplitude during the vowels (see measurement lines 1, 2 and 3 in figure 5.2). In Mamalie, the sequence 'alie' was measured as a unit, since the unstressed [1] could not always be so reliably segmented. The onset of the initial [1] of the surname was also somewhat problematic, especially when it was unstressed or followed an [m]. However, it was found fairly reliably by examining local perturbations in the autocorrelation and amplitude contours, by playing speech fragments, and by examining the formant transitions in computergenerated spectrograms. The time point for the F o maximum is probably the least reliable of the measurements. Segmental effects from the nasal and irregularities in the pitch made its location rather uncertain in many cases. Where there were two alternative locations that could be construed to be the relevant F o peak, we took the average between the two. The example F o contour in figure 5.2 illustrates one of the more uncertain cases: line 5 marks the chosen measurement point. Of RWS's utterances, 2 8 % had to be omitted from the data set because the speaker produced the wrong intonation pattern or because of articulation errors. After the initial measurements were taken and plotted, all utterances whose peak placements represented extreme outliers in the data set were individually examined. Values that were found to arise from measurement errors were corrected. This enterprise was conducted in a theoretically unbiased fashion; whenever an outlier was examined in a measurement set, another outlier which deviated from the trend in the opposite direction was also examined.
5.3
Statistical Analysis
In analyzing the data, we wished to look at the relationship between the structural characteristics of the utterances and the pitch accent alignment. A useful tool for modeling and statistically assessing such relationships is Multiple Regression.6 Applying this to the present results, we attempt to "predict" the location of the 79
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
Table 5.1 Coding of prosodic characteristics for all twelve names Name
wb sc
Ma Lemm Mom Lemm Mama Lemm Mamalie Lemm Ma Le Mann Mom Le Mann Mama Le Mann Mamalie Le Mann Ma Lemonick Mom Lemonick Mama Lemonick Mamalie Lemonick
1 1 0 0 1 1 0 0 1 1 0 0
1 1 0 0 0 0 0 0 1 1 0 0
F o peaks on the basis of other characteristics of the utterances. If there is a systematic relationship between the characteristics we choose and the alignment of the peaks, then the predictions will account for a significant amount of the variation in the data. The properties that we choose can be either quantitative or qualitative. In the latter case we represent a dichotomous feature as a binary variable: to represent the presence of word boundaries and stress clashes at the right-hand edge of the prenuclear stress, we have constructed two such variables; wb and sc. Each utterance receives a score of either 1 or 0 on each of these variables, depending on whether or not the corresponding feature is present. Hence for Ma Le Mann, which has a word boundary but no stress clash, wb = 1 and sc = 0. Table 5.1 lists these values for all twelve names. In a similar way, the three speech rates can be encoded in a further two variables: fast and slow. The variable fast is assigned a value of 1 for all utterances spoken at the fast speaking rate, and 0 otherwise. Similarly, slow is assigned 1 for each slow utterance, and 0 otherwise. Utterances spoken at the normal speaking rate do not require a separate variable, since they are already uniquely specified by virtue of being neither fast nor slow and so having a value of 0 for both variables. The values of the variables fast and slow for each speech rate are summarized in table 5.2. Relationships in the data can be evaluated by expressing them in algebraic (linear) equations using the above variables. We illustrate the technique here with a simple model in which we attempt to account for the peak placements in terms of speech rate alone. (The invariance hypothesis predicts that this should provide a good model of the data.) The equation is: (2)
peak delay = a + b.fast + c.slow where peak delay is the distance from the start of each vowel to its Fo peak, in milliseconds.
8o
The timing of prenuclear high accents in English Table 5.2 Coding of speech rate
Speech rate
fast
slow
slow normal fast
0 0 1
1 0 0
Table 5.3 Multiple Regression coefficients for prediction of absolute peak delay {in milliseconds) on the basis of speech rate alone, from equation (2) Speaker
a
JBP RWS
151 109
b -38 -24
c
R2
56 22
43-6% 25-9%
Multiple Regression calculates the three coefficients (a, b and c) that give the best fit to the data, and thereby enables us to assess how much of the variance in the dependent variable (in this case peak delay) can be accounted for by the independent variables (in this case fast and slow, which jointly represent speech rate). This proportion, known as R2, will here always be reported as a percentage of the overall variance in the dependent variable.7 Table 5.3 gives the coefficients and R2 values for this model. These results predict that (for JBP) the peak delay in utterances spoken at the middle speech rate will be on average 151 ms. In the fast utterances peaks will be placed on average 38 ms. earlier, and in the slow utterances they will occur on average 56 ms. later (i.e. peak delay will be 151—38 = 113 ms., and 151 + 56 = 207 ms., respectively). We emphasized on average for a reason. Although the signs of the coefficients in table 5.3 indicate that the relationship between speech rate and peak placement was in the same direction for both speakers, the R2 values in the table indicate that speech rate alone accounted for less than half of the variation in JBP's data, and only about a quarter of the variation for RWS. This model is not particularly successful because it ignores any influence of the prosodic features (the wb and sc variables were not included in the equation). We shall show that the effects of these variables were systematic, comparable in magnitude (but different in nature) to the effects of speech rate, and therefore are a necessary component of any model of peak placement.
8i
KIM E. A. SILVERMAN AND JANET B. PI ERREHUMBERT
5.4 5.4.1
Results
Upcoming context affects peak placement
The locations of the F o peaks, relative to the onsets of the associated vowels, showed considerable variation in the productions of both speakers. Figures 5.3a and 5.3b show that prosodic context determines the alignment of peaks with the accented syllables, in a way that is qualitatively different from the effects of speech rate. In these figures we have plotted the data for those names that correspond most closely to the subset of Steele's data which we presented in figure 5.1. Points labeled " 3 " represent the utterances where three unstressed syllables separate the prenuclear and nuclear accents (i.e. Mamalie Le Mann), while points labeled " 0 " indicate no intervening syllables {Ma Lemm and Mom Lemm). The horizontal axis represents the length of the rhyme in the syllable bearing the prenuclear accent (the /om/ in Mom and the /a/ in Ma and Mamalie). The delay from the start of the rhyme to the F o peak is plotted on the vertical axis. Regression lines summarize the trend for each of the two prosodic structures. The data exhibit a contextually-governed variation similar to that shown by Steele for nuclear pitch accents. Points for each name fan out into an elongated cloud, with the points further from the origin in each cloud representing slower utterances. The important pattern in the plots is that the separate clouds for each name are almost completely nonoverlapped. Within the cloud of points for each name, the relationship between peak delay and vowel length can largely be modeled by a line with a positive slope: when a vowel is lengthened because the utterance was spoken more slowly, the peak is correspondingly delayed. In contrast to this, the difference between the clouds shows that when a syllable is lengthened because of the upcoming prosodic context (as in MA Lemm), the peak is aligned much earlier in the syllable. Comparisons between other names in our corpus show that the difference in peak alignment between Mamalie Le Mann and Mom Lemm arises as a result of two contributing factors. Figures 5.4a and 5.4b illustrate that one of these is wordfinal lengthening. The plots compare the peak alignment in Mamalie Le Mann (" 3 ") with the alignment in Ma and Mom Le Mann (" 1"). In the latter names, the accented syllable is word-final and therefore longer, but the peaks occur earlier in the syllable rhymes. The second factor that contributes to early peak alignment is the effect of a stress clash. Figures 5.5a and 5.5b compare the data for Ma and Mom Lemm, in which the first syllable is lengthened by a stress clash as well as a word boundary, with Ma and Mom Le Mann, in which the first syllable is lengthened by a word boundary alone. In the former case we see that when a stress clash is present, syllables are even longer but peaks are again relatively earlier. These results show a systematic effect of the right-hand prosodic context on peak alignment, but they might seem to suggest a simpler explanation; namely that 82
The timing of prenuclear high accents in English 3a
Mamalie Le Mann (3) vs. Ma and Mom Lemm (0)
0.35 0
ids)
0.30 0
0.25 -
i_
o o
ela'
C/5
0.20 _
3 3 3>
0.15 _
A °°
0
/A / ^
^oo 0
pea
•o
0.10 -
0
3 % 3
0.05 0.0 t 0.0
i
0.1
3b
0.2 0.3 0.4 0.5 rhyme length (seconds)
0.6
0.7
Mamalie Le Mann (3) vs. Ma and Mom Lemm (0)
0.20
I 0
0.15
,
0° 0.10
«3
3
0
0£ 0.05 -
0.0 0.0
0 0
0
i
i
i
0.1 0.2 0.3 rhyme length (seconds)
i
0.4
Figures 5.3a (JBP) and 5.3b (RWS) Peak delay, relative to the onset of the vowel, as a function of the length of the syllable rhyme (/a/ in MA and MAmalie, /om/ in MOM). " 3 " = Mamalie Le Mann, "0" = Ma and Mom Lemm.
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
4a
Mamalie Le Mann (3) vs. Ma and Mom Le Man (1)
0.35F
0.0
4b
0.1
0.2 0.3 0.4 0.5 rhyme length (seconds)
0.6
0.7
Mamalie Le Mann (3) vs. Ma and Mom Le Man (1)
0.20
0.1 0.2 0.3 rhyme length (seconds)
0.4
Figures 5.4a (JBP) and 5.4b (RWS) Peak delay, relative to the onset of the vowel, as a function of rhyme length. " 3 " = Mamalie Le Mann, "1 " = Ma and Mom Le Mann.
rather than stress clashes or word boundaries, the relevant contextual feature is merely the number of following unstressed syllables. This simpler model is not supported by the data, however. Figures 6.6a and 6.6b compare peak alignment in Mama Lemm ( • ) with that in Ma and Mom Le Mann ( x). All of these 84
The timing of prenuclear high accents in English Ma and Mom Le Mann (D vs. Ma and Mom Lemm (0)
5a
0
0.35 -
ids)
0.30 -
1
V
1
0.25
_ i
o o
1
ela
0.20
]/
y* 1
0
1 y/ 0 ft
oo 0
0.15
0
10.10 0.05
-
0.0 0.0 r
5b
i
i
0.1
i
I
0.2 0.3 0.4 0.5 rhyme length (seconds)
I
0.6
0.7
Ma and Mom Le Mann (1) vs. Ma and Mom Lemm (0)
0.20F
c o o
1 >0.10
L
0.05-
O.Ot
0.0
0.1 0.2 0.3 rhyme length (seconds)
0.4
Figures 5.5a (JBP) and 5.5b (RWS) Peak delay, relative to the onset of the syllable rhyme, as a function of rhyme length. "1 " = Ma and Mom Le Mann, " 0 " = Ma and Mom Lemm.
utterances have only one intervening unstressed syllable, yet for both speakers the peaks are aligned earlier in the latter names, where the syllable bearing the prenuclear accent directly precedes a word boundary. Hence the differences in figures 5.6a and 5.6b are another manifestation of the same effect that was illustrated in figures 5.4a and 5.4b.
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
6a
Mama Lemm ( • ) vs. Ma and Mom Le Mann (X)
0.35 0.30
• 0.15 0.10 0.05 0.0 0.0
0.1
0.2 0.3 0.4 0.5 rhyme length (seconds)
0.6
0.7
Mama Lemm (D) vs. Ma and Mom Le Mann ( x )
6b
0.20
• ds)
[
0.15
c
o CJ
/
CD
y
0.05
X X
* /
x/
xx
x
x
X
x
n
0.0
0.0
/
/ *
X
E
/x
*
x
0.10
x:
J
i
len
O)
i
/•
c/)
s:
/
x
X
X
1
0.1
1
0.2 0.3 peak delay (seconds)
1
0.4
Figures 5.6a (JBP) and 5.6b (RWS) Peak delay in syllables followed by only one unstressed syllable. • = Mama Lemm, x = Ma and Mom Le Mann.
5.4.2
A statistical model of peak alignment
The above figures all show a tendency for peak delay differences between names to be larger at the slower rates, when the rhymes are longer. We have found that it is not the absolute peak delay, but rather the peak placement in proportion to 86
The timing of prenuclear high accents in English
Table 5.4 Regression coefficients for the proportional placement of Fo peaks, according to equation (3)
Speaker JBP RWS
a
1-223 1432
Prosodic context c b -0487 -0-774
Speech rate d 0-072 0-219
-0-170 -0-191
e
-0-212 -0-038
R2
64-2% 62-9%
the syllable rhyme length, that exhibits the most regular patterns.8 Our statistical analysis reflects this by modeling peak proportions (peak delay divided by the duration of the associated syllable rhyme).9 With the peak delays expressed as proportions in this way, we had far greater success in predicting the data on the basis of prosodic context and speech rate. The regression equation was: (3)
peak proportion = a + b.wb + c.sc + d.f ast + e.slow where peak proportion is the peak delay divided by the rhyme length
and the results are summarized in table 5.4. The R2 values show that for both speakers this model accounts for nearly § of the variance in the data. It captures the relationships illustrated in the previous figures, and shows (by the signs of the coefficients) that all of the effects work in the same direction for both speakers. The equations mean that in the absence of a word boundary or a stress clash, the peak occurs past the end of the rhyme, with a minor adjustment of the offset dependent on the speech rate. In the presence of a word boundary, the peak is moved earlier by a relatively large fixed proportion of the rhyme (0*487 for JBP, 0*774 for RWS), and it is moved even earlier by a fixed proportion (0170 for JBP, 0191 for RWS) in the presence of a stress clash. Both wb and sc are necessary in the model; dropping either of them significantly worsens the fit. The fact that the peaks are past the end of their associated syllable in utterances like Mamalie Le Mann is represented by the coefficient " a " being greater than 1. In fast speech, syllables are shorter and peaks occur proportionately later, so the " d " values are positive. Similarly, slow speech causes peaks to occur earlier in their syllables, so the " e " values are negative. This model assumes that the effects of prosodic context on the proportional peak alignment are independent of speaking rate. For example, in the case of JBP a word boundary will decrease the peak proportion by nearly half of the rhymelength (0*487) at all rates, regardless of its rate-dependent initial value. We can test this assumption by adding a set of variables to the equations in order to express each aspect of the possible interactions, as shown in table 5.5. When we repeat the analyses with these terms in the equations, we find that they add a very small, 87
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
Table 5.5 Variables carrying all aspects of an interaction between prosody and speech rate Interaction term wb x fast wb x slow sc x fast sc x slow
Interpretation Adjustment fast Adjustment slow Adjustment Adjustment
to the word-boundary effect when speech rate is to the word-boundary effect when speech rate is to the stress-clash effect when speech rate is fast to the stress-clash effect when speech rate is slow
statistically barely significant amount to the variance already accounted for by the model of equation (3) (for JBP a further 1 6 % , for RWS an extra 21 % 1 0 ) . We further find that unlike the main effects of prosody and speech rate that we showed in table 5.5, the directions of the interactions are not consistent across the two speakers. For example, the largest interaction for JBP (accounting for 1-3% of the variance) was that the word-boundary effect was decreased in magnitude by 0175 in the slow speaking rate. For RWS, however, in the slow speech rate the wordboundary effect was changed in the opposite direction: it increased by 0*254. Since the interactions were inconsistent across the two speakers and in any case did not clearly reach statistical significance, we believe that the data confirm that the effects of right-hand prosodic context on peak alignment (expressed as a proportion of the syllable rhyme) bear no systematic relationship to speech rate. To summarize so far, it was the proportional alignment of F o peaks with their associated syllables, rather than the absolute distance in time, that exhibited rulegoverned behavior. The effects of prosodic context and speech rate are consistent across both speakers in the way they influence peak placement, and they operate independently of each other. This simple additive model accounts for nearly § of the variance in the data. In figures 5.7a and 5.7b the data showing the combined prosodic effects are replotted (from figures 5.3a and 5.3b) along with the values predicted by this model. The R2 values yield an intuitively tractable but stringent evaluation of our model. Statistically, even if we had only been able to account for as little as 8 % of the variance in the data our result would still have been highly significant (/J 4175) = 344, p < 001). Multiple regression models do not often achieve R2 values as high as those presented here. Nevertheless we may ask why the fit was not even better. Three extraneous sources of variation in the data were: (i) uncertainty in measuring the precise peak locations in the presence of segmentallyinduced perturbations, as mentioned earlier; (ii) slight differences between Ma and Mom for RWS (peaks placed in an earlier proportion of the rhyme for Mom),
The timing of prenuclear high accents in English 7a
Mamalie Le Mann (3) vs. Ma and Mom Lemm (0)
0.0 0.0
0.1
0.2 0.3 0.4 0.5 rhyme length (seconds)
0.6
0.7
Mamalie Le Mann (3) vs. Ma and Mom Lemm (0)
7b 0.20F
-0.15-
0.10-
•0.05-
o.oL 0.0
0.1 0.2 0.3 rhyme length (seconds)
Figures 5.7a (JBP) and 5.7b (RWS) Peak delay, relative to the onset of the vowel, as a function of vowel length: observed versus predicted values. " 3 " = Mamalie Le Mann, " 0 " = Ma and Mom Lemm. The lines represent the proportions predicted by equation (3) for each rate, multiplied by the corresponding actual rhyme durations.
89
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
and (iii) JBP alternated between two different strategies for producing the slow condition: sometimes she would increase the amount of prosodic lengthening while retaining the proportional peak placement, and other times she would increase the syllable durations without changing the peak delay. This latter strategy was particularly evident when there was no adjacent prosodic lengthening trigger, such as in Mamalie Le Mann. 5.4.3
A related model of duration
How do the influences of right-hand context on peak alignment arise ? They seem to be the result of a combination of two different phonetic consequences of the prosodic structure. One of these is that the absolute time, and not just the relative time, between the vowel onset and the F o peak is decreased when a word boundary or stress clash follows. Figures 5.8a and 5.8b show the average peak delay as a function of speech rate for the combined set of Ma Lemm, Mom Lemm, Ma Lemonick and Mom Lemonick (i.e. those utterances with both a word boundary and a stress clash at the right-hand edge of the prenuclear syllable), compared with Mamalie Le Mann (which has neither of these features). In all three of the speech rates, RWS placed his peaks earlier - significantly closer to the vowel onset - in the former set of names. JBP's peaks showed a slightly more complicated pattern: her peaks in the former set of names were earlier than those in the latter set in the fast rate; in the normal rate the mean peak location was also earlier, though this difference did not reach statistical significance, and in the slow rate they were significantly later.11 Overall, the tendency was that for the most part the righthand prosodic context alters the F o trajectories by pushing the F o peaks to the left. The second means by which the right-hand context affects peak placement is indirect, via lengthening of the segmental material. Conspicuously, those environments in which peaks are early correspond to those environments in which prosodic lengthening occurs. This leads us to suspect that peak placement is related to the temporal structure: whatever it is that triggers prosodic lengthening may also affect proportional peak location. Two pieces of evidence lend support to this hypothesis. Firstly, as figures 5.8a and 5.8b indicate, the durations of the syllable rhymes show the same ranking as the peak alignment: earlier peaks tend to occur in longer syllables. Figures 5.9a and 5.9b present this generalization in a different way for three of the names: Ma Lemm (word boundary plus stress clash), Ma Le Mann (word boundary but no stress clash), and Mamalie Le Mann (no word boundary and no stress clash). The horizontal axis is the duration of the word-initial /m/, and the vertical axis is the length of the /a/. 12 What we see is that for both speakers the vowels are longer, relative to their preceding syllable-initial consonants, in precisely the same contexts where peaks are earlier. The scatter in the plots is due to inconsistencies in the length of the initial consonant - speakers 90
The timing of prenuclear high accents in English 8a
Mamalie Le Mann (s,m,f) vs. all stress clashes (S,M,F)
0.22
0.1 0.2 0.3 0.4 0.5 mean rhyme length (seconds)
8b
0.6
Mamalie Le Mann (s,m,f) vs. all stress clashes (S,M,F)
0.16 -
mean p)eak delay
0.14
s / /
m
S
0.12
0.10 -
M
0.08 F 0.06 r 0.05
i
0.10 0.15 0.20 0.25 mean rhyme length (seconds)
0.30
Figures 5.8a (JBP) and 5.8b (RWS) Peak delay versus vowel length, averaged within each speech rate. Lower case letters represent Mamalie Le Mann ("s" = slow, " m " = medium, " f " = fast), upper case letters represent the group of Ma Lemm, Mom Lemm, Ma Lemonick and Mom Lemonick ( " S " = slow, " M " = medium, " F " = fast).
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
9a Ma Lemm (0, solid line), Ma Le Mann (1, dots), Mamalie Le Mann (3, dashes) 0.6
0.02
0.04 0.06 0.08 onset length (seconds)
0.10
9b Ma Lemm (0, solid line), Ma Le Mann (1, dots), Mamalie Le Mann (3, dashes) 0.4 0
0.3 0
0 1
- 0.2 a)
1
o > 0.1 -_
1 cot
3
3 0.0 0.02
1
0" 0 1 0 1 3 0 T"
.CO
-0
1 ^J
1
0.03
0.04 0.05 0.06 0.07 onset length (seconds)
0.08
Figures 5.9a (JBP) and 5.9b (RWS) Vowel length as a function of the length of the preceding / m / . " 0 " = Ma Lemm, "1 " = Ma Le Mann, "3" = Mamalie Le Mann.
do not seem to exert such fine control here. Nevertheless, as the fitted regression lines show, the influence of right-hand prosodic context on vowel length emerges through the noise and is quite consistent. A better way to assess this ranking is to apply exactly the same multiple regression model to the rhyme lengths as we did to the peak proportions. This 92
The timing of prenuclear high accents in English
Table 5.6 Regression coefficients for rhyme lengths in milliseconds
Speaker JBP RWS
a 106 81
Prosodic context be 129 92
51 24
Speech rate d
e
-45
42
R2
allows us to use the data from all twelve names, rather than selecting only three of them, and does away with the need to rely on the noisy data of the word-initial / m / durations. If peak alignment arises from prosodic lengthening, then the same factors that were shown via equation (3) to account for the earlier placement of F o peaks in the syllable rhymes should explain as much or even more of the variance in the rhyme lengths. The model is: (4)
rhyme length = a + b.wb + c.sc + d.fast + e.slow
and the results are summarized in table 5.6. The most relevant information in the table is the prosodic effects and the amount of variance explained.13 Word boundaries and stress clashes lengthen syllable rhymes, by the number of milliseconds in columns " b " and u c " (the reader will recall that wb and sc only have values of 0 or 1, and thereby act as binary switches for the effects whose magnitude is represented by " b " and "c"), across all speech rates. The very high R2 values show that syllable rhyme durations conform to this pattern even more consistently than the peak placement data. Note that according to this model, the amount of lengthening applied to a syllable preceding a word boundary or a stress clash is a constant, in milliseconds, in all three speech rates. In reality, it is more likely that the magnitude of the prosodic effects expressed in such absolute time units would be rate-dependent. In other words, there may be statistically significant interactions between the two factors. Analyses of the rhyme durations with the same interaction terms as in table 5.5 indeed did show such interactions. For both speakers, the effects were larger in slow speech and smaller in fast speech, and this rate-dependence explained significantly more of the variance: 92-4% (as opposed to 838 %) for JBP and 82-1% (as opposed to 74-8%) for RWS.14 The second piece of evidence that peak placement is related to prosodic lengthening sheds more light on the mechanism by which this lengthening occurs. In figures 5.10a and 5.10b, the duration of the final / m / in Mom is plotted as a function of the vowel length for Mom Lemm and Mom Le Mann. What we see is that when the rhyme is lengthened by stress clash in Mom Lemm, the last part of 93
KIM E. A. SILVERMAN AND JANET B. PI ERREHUMBERT
10a
Mom Lemm (0, dotted line) vs. Mom Le Mann (1, solid line)
0.06 0.1
0.2 0.3 0.4 vowel length (seconds)
Mom Lemm (0, dotted line) vs Mom Le Mann (1 , solid line) 0 0 ,- 1 o .•••• 0
1Ob 0.16 0.14 -
p. • 0.12 -
5)
c
1
0
z
0
0.10 -
,
A
/
E
'i 0
0.08 0.06
DO,
1
.••••'" i
1
/
1
tfy V
1
°
/ 1
0.04 I 0.06
i
0.08
i
i
i
i
0.10 0.12 0.14 0.16 vowel length (seconds)
i
0.18
0.20
Figures 5.10a (JBP) and 5.10b (RWS) Duration of the final / m / in "Mom," as a function of the duration of the vowel. " 0 " = Mom Lemm, "1 " = Mom Le Mann.
the rhyme is lengthened the most. Thus the cloud of points for Mom Lemm is higher than in the case of Mom Le Mann. These then are the two ways that right-hand prosodic context influences peak placement: when a syllable is lengthened by an upcoming word boundary or stress clash, then the peak is moved to the left. At the same time, the syllable is 94
The timing of prenuclear high accents in English
lengthened in such a way that its right-hand edge is moved to the right, relative to the location of the accent peak. However, this coincidence between durational effects and effects on peak delay is not perfect. As well as being subject to the influence of prosodic triggers located immediately to the right of the syllable bearing the prenuclear accent, RWS's peaks occurred later when more unstressed syllables separated that syllable from the lengthening trigger. To evaluate the consistency of this effect statistically, we can replace the dichotomous sc variable in the regression equations with a quasigradient variable, which we call gradsc. This variable ranks the names according to the proximity (in syllables) of the upcoming nuclear pitch accent in the surname. It has a value of 1 for Ma Lemm and Mom Lemm, which constitute the stress clash case expressed by sc, and a value of 0 in Mamalie Le Mann, where the greatest number of syllables intervenes between the two accents. In all other names gradsc has a value of | or §, according to whether there are two or one intervening syllables, respectively. If the effect of the stress clash on peak alignment is indeed a systematically gradient phenomenon, then replacing sc by gradsc in equation (3) should significantly improve how well the model fits the data. This is precisely the result we obtain: the amount of variance explained increased significantly from 62-9% to 69*3 %.15 This effect is not reflected in the durations of the syllables: replacing sc by gradsc in equation (4), which modeled the rhyme lengths, decreased the fit of the model instead of increasing it. Consequently the longer-range influence on the peak proportions seems to have been caused by the peaks being pushed to the left, as if they were somewhat repelled from the upcoming nuclear accent. Unlike the influence of a stress clash on rhyme durations, this tonal repulsion is a gradient phenomenon that extends with decreasing magnitude over a number of intervening unstressed syllables, and it occurs without any concomitant lengthening of the accented syllable. 5.4.4 Summary of the results Both speech rate and right-hand prosodic context influence F o peak placement, but they do so in qualitatively different ways. When a syllable is lengthened from being spoken more slowly, the peak will occur corresponding later. In contrast, when the lengthening is induced by the right-hand prosodic context, the later part of the syllable undergoes disproportionately more lengthening and at the same time the peak will occur earlier in the syllable rhyme. In addition to this lengthrelated effect, for one of the speakers a leftward push on the prenuclear peak is exerted by the upcoming nuclear pitch accent; this extends over several syllables, and has no concomitant influence on duration.
95
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
5.5 5.5.1
Discussion
Nuclear versus prenuclear peaks
The data from the present experiment, when compared with that of Steele (1986), establish clear parallels between prenuclear and nuclear H* pitch peaks. Prenuclear peaks, like nuclear peaks, are aligned in proportion to the duration of the associated syllable, rather than a fixed distance into the vowel. In both positions, speech rate and right-hand prosodic context have different effects on peak location. Also, in both positions peaks are aligned later when there are more unstressed syllables following the syllable bearing the accent. In addition to establishing these parallels, the current results provide further insight into the relevant contextual features and the mechanism by which peaks are aligned. It is not primarily the presence or number of following unstressed syllables per se, but rather the amount of prosodic lengthening that most directly determines where peaks will be placed. Consequently any contextual feature which induces prosodic lengthening will align peaks earlier in their syllables, be it a stress clash, word boundary, or (as found by Steele) utterance-final lengthening. If, on the other hand, a syllable's duration is varied by other factors, such as changing the speech rate, or substituting a vowel with a different intrinsic duration (Steele, 1986), then the peak delay will shift in such a way that its relationship to the overall syllable duration will be maintained. In section 5.5.2.4, below, we will return to the obvious question of exactly how prosodic lengthening might be unique. One difference remains between our data on prenuclear H* accents and Steele's data for the same accents in nuclear position; namely that peaks are absolutely earlier if the words that bear them are in nuclear position than if they are prenuclear. However it may still be possible to explain this difference in the framework of a single set of tonal implementation rules.16 Phrase-final nuclear syllables, which have the earliest peaks by far (compare the " 0 " points in figure 5.1 with those in figure 5.3b), undergo the greatest amount of prosodic lengthening and so we would expect peaks on these syllables to be correspondingly earlier. In addition, in these cases there is a Low (L) tone immediately following the H*, on the same syllable. Even in Steele's data for non-phrase-final nuclear accents the H* is still followed by a Low. If tonal repulsion is a factor in peak alignment, this would exert an extra left-ward push on the H* that would be absent in the case of prenuclear accents.
5.5.2 5.5.2.1
Mechanisms for similarity INVARIANT RISE-TIME
At first glance one might be tempted to assume that the durational structure alone, in combination with a rate-dependent but otherwise invariant F o rise time, would account for the data. Such a view would be predicted, for example, within the 96
The timing of prenuclear high accents in English
framework of the Dutch school of intonation (e.g. 't Hart and Collier, 1975; de Pijper, 1983). According to such a model, the absolute peak delay for all utterances spoken at a particular speech rate is constant; any apparent earlier peak placement wholly arises from prosodically-induced lengthening, and so is an artifact of our analysing peak proportions rather than absolute peak delays. However, closer inspection of the data shows that this model is insufficient to explain a number of the patterns. Figures 5.8a and 5.8b showed absolute peak delays within each speech rate in lengthening versus nonlengthening environments. For RWS, peak delays were shorter (i.e. closer to the vowel onset) in prosodically lengthened syllables in all three speech rates. For JBP the peak delays were also shorter in the lengthened syllables spoken at the fast rate, but were nearly the same in the normal rate, and longer in the slow rate. These differences within each rate between prosodically lengthened and relatively unlengthened syllables run counter to the prediction of the invariance hypothesis. The figures illustrated how within each utterance type the mean delays for all three speech rates are positioned along lines representing prosodically-determined proportions. The one pattern that did not occur was the very pattern required by an invariant rise-time: constant absolute peak delay for all utterances within each speech rate (i.e. no shift in the vertical axis despite a right ward shift on the horizontal axis). Another pattern that contradicts the possibility of an invariant rise-time is the one in RWS's data that we described earlier, whereby his peaks were later in absolute time when more syllables separated the accented syllable from the lengthening trigger, while the accented syllable's duration was not correspondingly adjusted. If the rise to the prenuclear peak had an invariant duration, then the number of following unstressed syllables could not exhibit any such relationship with the peak delay in the syllable bearing the prenuclear accent.
5.5.2.2 GESTURAL OVERLAP AND TONAL REPULSION
A less simplistic phonetic explanation for the variation in peak alignment is possible if we allow temporal overlap of the underlying gestures for the prenuclear and nuclear pitch accents in those cases where there is insufficient intervening unstressed material separating them. Such an explanation is in the spirit of Bruce's description of Swedish intonation (1977), and is not unlike Browman and Goldstein's approach to coarticulation (this volume). If we consider the present data in this light, we notice that in the nuclear accent (H + L*) F o steps down onto the nuclear syllable from an immediately-preceding higher level. This movement must begin before that syllable, and so it is possible that it would overlap the prenuclear rise (the movement for the prenuclear H*) when the two accents are juxtaposed. It is difficult to generate testable predictions of this hypothesis without making some extra assumptions about how the laryngeal gestures interact when they 97
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
hT
M a
m
H + L*
a
actual Fo peak —i \
Ma actual Fo peak
Ma
I i e
Le
M a n n
>— underlying
/
prenuclear peak
L em underlying prenuclear peak
Le m m \
Figure 5.11 Schematic representation of underlying accent gestures for H* and H + L*, and possible resultant Fo contours, in nonoverlapped (a) and overlapped (b and c) cases.
overlap. Two possibilities are illustrated in figure 5.11. The upper diagram (5.11a) schematizes the gestures for the prenuclear and nuclear accents realized on Mamalie Le Mann, with sufficient intervening material to separate them. The lower two diagrams concern the case for Ma Lemm, where the gestures overlap each other. In 5.11b, the F o contour (the dashed line) follows the minimum of the values specified by the two gestures. In 5.1 lc the second gesture wins out over the first. In both, the observed F o peak is displaced to the left of its "underlying" location. These rather simple models are of course not the only possible ways that overlapping underlying accent gestures can be implemented. For example,
The timing of prenuclear high accents in English
Silverman (1987) took a hybrid approach in which a preceding accent (the prenuclear H # , in the current example) will partly override the first, nonaligned tone of a following bitonal accent (in this case the H of the nuclear H + L*) but will itself yield to the aligned tone (in this case the L* of the H + L*) in those (probably few) cases when there is a great deal of overlap. An extra constraint was that the nonaligned tone could be effectively shortened but never completely elided by this process, so that the shortest addressable section of the H immediately adjacent to the L in H + L* would be maintained by the phonetic realization rules. We have no way to directly estimate the times for the gestures corresponding to the two accents. However, according to the gestural overlap hypothesis, the relation of the underlying melodic gestures to the segmental gestures is not being varied. This permits us to derive the prediction that the F o peak alignment can be computed as a function of the distance in time between the accented syllables. This prediction is shared by the tonal repulsion hypothesis, although it arises differently there. Tonal repulsion means that the entire gesture for the first accent will be shifted earlier when the next accent follows too closely in time. (There is an implicit functional assumption that the timing of the gesture must be adjusted so that the gesture can be carried out with sufficient completeness.) The overlap and the repulsion models differ in principle in their predictions about the F o values at the peak and during the rise. However, this difference is difficult to evaluate in practice, because of our lack of understanding of gestural overlap. In figure 5.11b for example, the F o peak is lowered by the overlap, but in 5.11c it is not. Accordingly, we confine our attention to the relation between peak delay and temporal separation between the accented syllables, and thereby treat the two theories together. In the model we have already developed to account for peak alignment, in which we showed that peak proportions exhibited more rule-governed regularities (behaved in a more regular rule-governed fashion) than absolute peak delays, we accounted for 64%-70% of the variation in the data. The model predicted peak placement on the basis of the prosodic variables wb and gradsc. If early peak placement is solely the result of gestural overlap or tonal repulsion, then the success of the model was entirely due to the tendency for wb and gradsc to carry gross information about the distance between the accents. If this is true, then we should be able to account for an even greater percentage of the variation in the peak alignments (or at worst a comparable amount) on the basis of the actual temporal separation of the accents. In the absence of the relevant articulatory data concerning the precise startingpoints of the prenuclear and nuclear gestures, it seems both reasonable and consistent with the overlap/repulsion hypotheses to assume that the temporal distance between the accents is closely correlated with the distance between the onsets of the accented syllables. Consequently we selected out those utterances 99
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
Table 5.7 Regression coefficients for proportional placement of Fo peaks in names ending with Lemm and Lemonick, according to equation (5)
Speaker JBP RWS
a
1-293 1-528
Prosodic context b c
Speech rate d
-0-350 -0-216
0-043 0-248
-0-375 -0-862
e
-0-187 -0016
R2 61-1% 65-9%
ending in Lemm and Lemonick (i.e. two thirds of the total corpus), because the onset of the / I / in these surnames was the start of the nuclear syllable. The interaccent distance was thus estimated by calculating for these utterances the distance between the onset of the vowel bearing the prenuclear accent and the onset of the nuclear /I/. 1 7 We shall call this variable IAdist. The model of peak proportions against which we are comparing the tonal undershoot hypothesis is (5)
peak proportion = a + b.wb + c.gradsc + d.f ast + e.slow
For comparing this model with one representing the overlap/repulsion hypotheses, we must restrict our analysis to those names for which we also have IAdist values. Table 5.7 gives the coefficients derived from a multiple regression of this model using this subset of the utterances. If the hypotheses are to account for the data, then the absolute peak delays should be a rate-dependent function of the interaccent distance: (6)
peak delay = a + b.fast.lAdist + c.slow.lAdist + d.fast + e.slow + f.lAdist
This can be rewritten in the form: peak delay = (b.fast + c.slow + f).IAdist+ (d.fast+ e.slow + a)
This latter form makes clearer that peak delay is predicted as a simple linear function of IAdist, where the coefficient and constant vary according to the speech rate. Therefore the results are presented in table 5.8 separately for each rate, in the form: (7)
peak delay = g.IAdist + h
These regressions explain a statistically significant proportion of the peak delay variation (for JBP, F(5114) = 21-301, p <^ 0001; for RWS, F{5111) = 12-476, p < 0-001). But at the same time they are significantly less successful than our own model was in its account of the peak proportions (in a test of the difference ioo
The timing of prenuclear high accents in English Table 5.8 Regression coefficients from analyses using equation (6), expressed in the form of equation (7), for all names except those ending in Le Mann
Speaker
fast %
h
mid g
h
slow g
h
JBP RWS
0-116 0-210
83-2 46-5
0-169 -0-003
96-3 99-7
0-099 0-277
152-2 39-7
R2 48-3% 36-0%
between the two models: for JBP, z = 2-352, p < 0 0 1 ; for RWS, z = 3490, p < 0-001).18 The partial success of this model can be explained by a number of factors. First IAdist indirectly carries information about prosodic structure, which was more finely encoded in model (3). Second, model (6) permits the absolute peak delay to be greater when the number of syllables between the accent locations is increased. Third, it is able to incorporate the gradient stress clash effect observed in RWS's data. The results of model (6) do not allow us to conclusively reject gestural overlap and tonal repulsion as factors in explaining peak alignment. Nevertheless given the significantly lower overall success of model (6), as compared to model (3), it seems unlikely that they provide a complete explanation. Model (3) has less parameters, and at the same time is more successful in predicting the observed data, and so in principle we prefer it. 5.5.2.3
EXTRA BEATS
The reader may be tempted to apply the framework of Selkirk (1984) to explain the data we have described. This represents, in our opinion, the most plausible candidate for an explanation of how the contextual effects might be phonologically mediated. Within this framework, phonological phrase boundaries are encoded in the metrical grid using silent timing units; the number of timing units introduced depends on the strength of the boundary involved, and on whether a stress clash needs to be resolved. Units (or at least, a sufficiently large string of such units) may be realized as a pause; typically, however, they borrow segmental material from the phrase-final syllable, and are accordingly produced using syllable lengthening. The effects of prosody on peak proportions might be taken to follow from this description, if we assume that the pitch accent stays with the grid alignment corresponding to its original metrical association, and does not propagate to following silent beats. We do not believe that this explanation is correct. In general, any effects that might be described by such phonological mediation could be equally well described by permitting the phonetic implementation to have direct access to a
KIM E. A. SILVERMAN AND JANET B. PI ERREHUMBERT
hierarchical phonological structure. Encoding the prosodic differences into an intermediate representation with silent beats does not appear to advance the explanation. More specifically, the very same pattern that we have described for RWS, showing a gradient effect on tonal alignment that extends over a number of syllables, requires that extra beats be inserted elsewhere than just at the boundaries. The absence of any related increase in durations presents extra difficulties for such an account. Furthermore, the beats must somehow be increased or decreased in quantity according to the proximity of the upcoming accent, since RWS's peak alignments definitely showed this pattern. The rules for parcelling out the extra beats look quantitative, rather than qualitative. We conclude that the computation of extra beats unnecessarily complicates the explanation of the data: the tonal alignment rules can just as easily refer directly to the underlying metrical structure.
5.5.2.4
SONORITY PROFILE
In the introduction, our fifth conjecture related peak alignment differences to differences in the sonority profile of the syllable. This hypothesis is suggested by the well-known observation that syllable rhymes are more affected by prosodic lengthening than syllable onsets. Figures 5.10a and 5.10b refined this observation by showing a differential effect within the rhyme of Mom. In order for coupling to the sonority profile to explain our data, it must shift the peak left-ward when the end of the syllable is lengthened by a factor on the right. Figure 5.12 illustrates a computation which has this result. The sonority profile of the syllable is schematized as an upslope and a downslope, and the proportional peak placement is computed from the ratio of the two. The top half shows the non-lengthened case, where the F o rise reaches its peak at the end of the syllable's duration. (Actually, in our data for both speakers the peak was even past the end of the syllable in these cases.) The bottom half of the figure shows the prosodic lengthening as having applied more to the downslope than to the upslope, and the F o peak occurring proportionately as well as absolutely earlier. This coupling provides a single mechanism for peak placement in prenuclear and nuclear position: in both cases peaks occur earlier when a lengthening trigger occurs to the right. This mechanism accounts at the same time for the two different effects of right-hand prosodic context that we described in section 5.4.3: the tendency for peaks to move to the left before a lengthening trigger arises because that trigger alters the sonority profile in such a way that the proportional placement of the peak within the syllable is reduced. Of course the bit of algebra in figure 5.12 is not especially plausible as a model of speech production. Much further work would be necessary to formulate this model in a way that is rooted in articulatory coordination, and to evaluate it rigorously. 102
The timing of prenuclear high accents in English
Sonority
Fo
Non-lengthened syllable (e.g. /Wamalie Le Mann)
syllable duration I
syllable duration I proportion in syllable = f
upslope
Lengthened syllable (e.g. Ma Lemm) /
syllable duration II
syllable duration II
Figure 5.12 Stylized sonority profiles and corresponding Fo trajectories for an H* accent realized on nonlengthened (upper half) and lengthened (lower half) instances of the syllable /ma/.
5.6
Conclusion
The original notions of nuclear accents being different and distinct from prenuclear accents arose partly on descriptive grounds and partly from considerations of intonational function. By allowing a phonology of intonation and a set of phonetic implementation rules, we have an intermediate level of prosodic representation that intervenes between form and function, and at the same time enables simpler and more comprehensive models of surface phonetic realizations. Our data support parallel phonological and phonetic treatment of nuclear and prenuclear accents, and give further insight into the contextual factors affecting peak alignment in both positions. As we discussed in section 5.1, a remaining difference between the positions (i.e. that H* peaks are aligned earlier if the accents are in nuclear position) might be explained in terms of greater lengthening on nuclear syllables. Alternatively, one might wish to argue that when H* accents occur in nuclear position they are repelled to the left by the closeness of the immediately following L tone. But either way, the question of whether nuclear pitch accents are distinct from prenuclear accents becomes superseded in the light of the current framework. We believe that the question becomes the more general one of what information about an utterance's abstract phonological structure, on which autosegmental tiers, is accessed by the tonal implementation rules. 103
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
Further research will be needed to determine the mechanism by which the alignment patterns arise. Simple models based on conjectures that the F o rise is invariant or that phonological rules insert extra beats are not supported by the data. There is some evidence for overlap of gestures and/or tonal repulsion, but these alone are inadequate to fully explain the observed alignment patterns. Coupling of the F o trajectory to the syllable's sonority profile also seems to be involved. We have sampled and tested a number of approaches to describing how words and melody might be coordinated, ranging from articulatory through to more abstract phonological levels of explanation. For each alternative we have attempted to be explicit about how it could generate the surface phonetic structure, and we have tried to derive corresponding quantitative predictions. To the extent that we have succeeded, we have been able to bring experimental methodologies to bear on the interplay of phonological representation and phonetic implementation.
Notes 1 See Pierrehumbert and Beckman (1988) for a more complete exposition of autosegmental association in relation to prosodic structure. 2 Here and subsequently in this chapter we shall refer to Steele's (1986) oral presentation of the data. For a full description and discussion, the reader is referred to Steele and Silverman (in preparation). 3 A nuclear accent is normally defined as the last, and typically most salient, pitch accent in an intonational phrase. 4 Lexical stress on the polysyllabic names was as follows: MAma; MAmalie (similar to the name ROsalie); Le MANN; and LEmonick. 5 For expositional purposes, we will say that a stressed syllable is longer when it clashes with an upcoming stress. Obviously, it would be equally possible to say that it is shorter when it is separated by unstressed material from any upcoming stress. Either way of putting it is tantamount to positing a foot isochrony effect, provided that each foot includes a stressed syllable and all unstressed syllables up to the next stress, without regard to the presence of word boundaries. 6 Cohen and Cohen (1975) is a useful reference for multiple regression, which we relied on extensively. 7 Mathematically, R2 is simply the squared coefficient of the correlation between the predicted and observed values of the dependent variable. 8 Analyses in the form of equation (3) in which the dependent variable was peak delay rather than peak proportion yielded R2 values of 445 % for JBP and 315 % for RWS, both being significantly poorer fits than those given in table 5.4. 9 Plots similar to those in figures 5.3 to 5.6 indicated that the data for Mom most resembled the rest of the corpus when the final / m / was included. Some small differences still remained - particularly for RWS - which are partly responsible for the variance not explained by the model developed below. 10 Significance tests on the semipartial R2s for the set of interaction terms yielded for JBP /; 4 a 7 1 ) = 1-9867, p = O099; and for RWS F{i 166) = 24465, p = 0-048. 11 These differences were measured with t-tests for unrelated samples, rather than with 104
The timing of prenuclear high accents in English
12 13 14 15
16
17
18
paired comparisons in an analysis of variance, because the latter method would have pooled the error terms across all cells. This pooling would not have been justified since the peak delay data did not exhibit homogeneity of variance: the greater spread at the slow rate would have swamped out the differences in the other rates. The results for the fast, normal and slow rates, respectively, were: for RWS tlz = —4*673, p < 0-001; tl3 = -2-021, p < 0-05; /13 = -1-989, p < 0-05, and for JBP f13 = -1-976, p < 0-05; f13 = -0-677, p = 0-3; t13 = 2017, p < 0-05. It is well established that prosodic lengthening affects syllable rhymes much more than syllable onsets. This can be seen in the results of, for example, Klatt (1976) and Nakatani and Schaffer (1978). The reader may note that the signs of the coefficients in this table are the opposite of those in equation (3) and table 5.6. This is as it should be, because in this case the figures refer to changes in duration, rather than changes in syllable-relative peak-placement. Significance tests for the amount of variance explained by the set of interaction terms yielded for JBP F(i 171) = 47-8786, p < 0-0001; and for RWS F (4166 = 16-7692, p < 0-0001. Replacing sc by gradsc also increased how well the model fitted JBP's data, although the effect was not large enough, relative to the noise, to reach statistical significance. A test of the difference between the two dependent semipartial correlation coefficients on the peak proportions after partialling out the variance due to speech rate and word boundaries yielded for RWS: tll2 = 6771, p < 0-0001; and for JBP: /177 = 1-234, p = 0110. Note, however, that this of itself does not do away with the distinction between nuclear and prenuclear as prosodic categories. We believe that the distinction exists in prosodic organization, but do not find evidence for it in the inventory of English pitch accents or the phonetic rules for pronouncing them. By measuring from the prenuclear vowel onset, rather than from the initial consonant, we were able to exclude from the data the extraneous noise due to variability in the utterance-initial /m/, and yet maintain a measurement point that was temporally associated with the onset of the prenuclear accent gesture. We used the /l/-onset for the nuclear syllable because the vowel onset was not available. The bivariate correlations between the observed values and those predicted by the two models were adjusted according to the sample size and the number of independent variables in each model, according to Cohen's Shrunken R2, and then compared as independent product-moment correlation coefficients using Fisher's 2' transformation. Note that the r's were in this case not completely independent, because peak delays are correlated with peak proportions. This makes the test conservative: it is likely to underestimate the difference between the two models.
References Browman, C. P. and L. Goldstein, (this volume). Tiers in articulatory phonology, with some implications for casual speech. Bruce, G. 1977. Swedish Word Accents in Sentence Perspective (Travaux de Flnstitut de Linguistique de Lund). Lund: CWK Gleerup. Cohen, J. and P. Cohen. 1975. Applied Multiple Regression Correlation Analysis for the Behavioral Sciences. Hillsdale NJ: Lawrence Erlbaum. 't Hart, J. and R. Collier, 1975. Integrating different levels of intonation analysis. Journal of Phonetics 3: 235-255. 105
KIM E. A. SILVERMAN AND JANET B. PIERREHUMBERT
Klatt, D. H. 1976. Linguistic uses of segmental duration in English: acoustic and perceptual evidence. Journal of the Acoustical Society of America 59: 1208-1221. Mattingly, I. G. 1966. Synthesis by rule of prosodic features. Language and Speech. 9: 1-13. Nakatani, L. H. and J. Schaffer. 1978. Hearing words without words: prosodic cues for word perception. Journal of the Acoustical Society of America 63: 234-244. O'Connor, J. D. and G. R. Arnold. 1973. Intonation of Colloquial English. 2nd edition. London: Longman. Pierrehumbert, J. B. 1980. The phonology and phonetics of English intonation. Ph.D. dissertation, MIT. 1981. Synthesizing intonation. Journal of Acoustical Society of America 70: 985-995. Pierrehumbert, J. B. and M. E. Beckman. 1988. Japanese Tone Structure. LI Monograph Series, Cambridge, MA: MIT Press. de Pijper, J. R. 1983. Modelling British English Intonation. Dordrecht: Foris. Selkirk, E. O. 1984. Phonology and Syntax: The Relation between Sound and Structure. Cambridge, MA: MIT Press. Silverman, K. E. A. 1987. The structure and processing of fundamental frequency contours. Ph.D. dissertation, Cambridge University. Steele, S. A. 1986. Nuclear accent F o peak location: effects of rate, vowel, and number of following syllables. Journal of the Acoustical Society, Supplement /, 80; s51. Steele, S. A. and K. E. A. Silverman. (in preparation). Alignment of nuclear pitch accents: measurements and a model. MS.
106
Alignment and composition of tonal accents: comments on Silverman and Pierrehumberf s paper GOSTA BRUCE
6.1
Introduction
The paper by Kim Silverman and Janet Pierrehumbert specifically addresses the timing of immediately prenuclear high accents in English with respect to the following nuclear accent. A more general topic of the paper is the coordination between rhythmical structure and tonal (or accentual) structure in human, spoken language. Another general issue coupled to these topics and approached in the paper is the division of labor between phonology and phonetics. I will begin my discussion by listing a number of factors that may determine the timing of tonal peaks (or other points in the tonal structure) relative to segmental references as part of the rhythmical make-up of an utterance. The list is not meant to be exhaustive but is intended to show that there are a fair number of factors involved, some of which we know affect accent timing and others that are likely to do so, although experimental evidence is still lacking. There is also probably some overlapping among certain categories in my taxonomy. 1. Tonal composition (phonological analysis of pitch accents - whether analyzed as mono- or bitonal, linked or unlinked tones, targets or gestures-can influence the results of a phonetic analysis). 2. Prosodic context a. Boundaries (word, phrase, utterance, etc.) b. Rhythmical organization (rhythmical grouping, e.g. stress clash) c. Focus (prefocal, focal, postfocal position) d. Tonal environment (tonal interaction within and between successive pitch accents, e.g. tonal crowding) e. Pitch range (local or global, e.g. differences in degree of overall emphasis due to degree of involvement) f. Global intonation (e.g. absence/presence of downdrift due to interrogative /declarative structure) 3. Segmental context (e.g. differences in intrinsic vowel length) 4. Speaking rate (fast, normal, slow tempo) 107
GOSTA BRUCE
Out of these possible factors the authors chose to vary systematically speaking rate, word boundary location, presence/absence of stress clash and the number of inter-stress syllables. This choice of variables in the design of the test material made it possible to show for prenuclear position - as has been shown earlier for nuclear position - that lengthening of the stressed vowel induced prosodically by word boundary or stress clash has an effect on tonal peak location different from that when lengthening is brought about by a change in speaking rate. Lengthening a vowel in a slower tempo will delay correspondingly the location of the peak in the vowel, while lengthening a vowel near a boundary or in a stress clash will typically not move the F o peak to a later location in the vowel.
6.2
Discussion of tonal composition
The main purpose of the experiment reported in the paper was to account for the temporal location of the F o maximum of the so-called H(igh)* accent in English relative to the stressed syllable. The H* pitch accent in prenuclear position appears typically to be realized as a pitch rise through the stressed vowel to the high turning point (Fo maximum) at the end of or even after the actual stressed syllable. According to the authors' calculations and modeling the default temporal location for the F o maximum is clearly beyond the rime of the stressed syllable (for both speakers), i.e. in a normal speech tempo and with no word boundary or stress clash. It is only in the prosodically induced lengthening environments that the F o maximum occurs within the rime of the stressed syllable. However, considering that the authors admit to even more difficulty in determining the exact location of the high turning points than determining segment boundaries between sonorant segments of the speech material, it seems reasonable to ask if the F o maximum is the most representative point of the actual tonal gesture. Furthermore, it would have been valuable to have had data on the preceding L(ow) point before the rise (or some other tonal point) in order to be able to assess the temporal variability of the accent gesture. It seems as if the authors have already prejudged the issue by calling the pitch accent H*. It may instead be the timing of the whole pitch accent gesture, as expressed by the timing of both the L and H turning points, that is relevant. Although the H* accent is claimed to be distinct from the L + H* and the L* + H accents in English (Pierrehumbert 1980; Beckman and Pierrehumbert 1986), I assume that it is still typically manifested as a pitch rise in the actual context. This brings up the general question of how close a relationship we should expect between a phonological description and its phonetic manifestation, and also what language-specific variation in the timing of a pitch accent, e.g. a H* accent, we may assume. As an initial assumption for a pitch accent denoted for example H*, I would expect that the actual pitch accent tone is manifested within the stressed syllable, as implied by the star notation. 108
Alignment and composition of tonal accents
6.3
Tonal crowding: evidence from Swedish
To facilitate my discussion of the accent timing issue I will include in my commentary a brief overview of my understanding of Swedish accentuation. In my (1977) analysis of the pitch patterns of the Swedish word accents, where the main variables were focus, utterance position and number of inter-stress syllables, I concluded that the temporally least variable point was the tonal peak (for each of the two word accents) associated with the stressed syllable. This led me to hypothesize that in Swedish - and perhaps in other stress-timed languages as well, where the stressed syllable is dominant and unstressed syllables clearly subordinated in a stress group - alignment of tonal elements to rhythmical elements is critical only at certain points in the rhythmical structure, namely at stress group boundaries (Bruce 1983). Apart from that coupling there is a more or less floating relationship between F o and segmentals. As an illustration I will use Swedish material collected in another context and analyzed qualitatively (Bruce 1986). The variables here were for the word in focus: 1) word accent - accent I (H)L* or accent II H*L with the addition of a following focal accent H - and 2) position - final or non-final. A terminal L is added to a word in final position, after the focal accent, if any. In non-final position, the word in focus was followed by a non-focal accent I word (H)L* and there were a variable number of unstressed syllables between the accents (see figure 6.1). Concentrating on focal accent II (H*L H), we observe that, when the accents are well separated, the focal accent H occurs late in relation to segmental references, in the second syllable after the stressed syllable (figure 6.1, lower right). Adding more unstressed material between the accents has no apparent effect on the location of the H. When the accents come closer, the H now occurs in the final part of the post-stress syllable (figure 6.1, central right), and when the focal accent II occurs in utterance final position, the H occurs early in the post-stress (= final) syllable (figure 6.1, upper right). There is some slight, temporal variation in timing also for the preceding L, but much less, whereas the H* of the word accent has no apparent variation in its timing (see also Bruce 1977). The timing of the focal accent H varies in a parallel way for accent I in focus (figure 6.1, left-hand part). Up to a point, the more material intervening between the two accents, the later the focal accent H appears to occur. But interestingly enough, when the F o contours of the H*L H gesture for different environments are compared without segmental references other than being lined up with reference to the beginning of the stressed vowel, the H*L H gesture appears to have a fairly invariant timing pattern (see figure 6.2). Although the H of the focal accent II, admittedly, has a slightly later timing in longer than in shorter right-hand contexts, the whole H*L H gesture is surprisingly constant in its timing. This would not have been expected, if the H (and the preceding L) were critically timed to certain syllables. 109
GOSTA BRUCE Accent
Accent I
Hz
180140-
100-
100
200 ms. Figure 6.1 Accent timing and segmental references in Swedish. Fo tracings of 6-7 repetitions of a disyllabic word in focus (accent I or accent II) in different right-hand contexts. The right-hand contexts were: 1) final position (top) and 2) nonfinal position followed by a nonfocal accent I word: a) with accented syllables separated by just one unstressed syllable (middle) and b) with accented syllables separated by two unstressed syllables (bottom).
The apparently earlier location of the focal accent H relative to segmental references in short, crowded contexts (figure 6.1) is then due mainly to lengthenings - of an utterance final syllable and of a single unstressed syllable in a stress group - that do not occur in longer, less crowded contexts and are not due to a change in the timing of the tonal gesture. no
Alignment and composition of tonal accents Accent I
200 ms. Figure 6.2 Accent timing without segmental references. Comparison of Fo tracings of a disyllabic word in focus (accent I or accent II) in different right-hand contexts. Two examples.
6.4
Possible models accounting for accent timing
To account for the observed variation in the timing pattern of the F o maximum of the prenuclear High accent in English the authors present a number of interesting alternatives. The starting point of this discussion is the rather successful modeling of peak proportion, i.e. the timing of the tonal peak in proportion to its syllable rime length. This model also implies that durations are somehow basic and that tonal structure is dependent on temporal structure. The first alternative discussed is termed Invariance: i.e. for a given speaking rate this would predict an invariant rise time and thus a constancy in the absolute peak delay relative to the vowel onset. Although prosodically induced lengthening may (partly) account for the apparently earlier timing of tonal peaks with reference to syllables even in the Swedish data above, there is still no invariant rise time to be observed. But if we instead interpret invariance as the absence of any apparent temporal reorganization of the whole accent gesture — as for example earlier timing or a steeper rise - with primary variation in segmental durations giving an apparently earlier timing of tonal peaks, then the time course of the accent gesture in Swedish (figure 6.2) appears to be invariant in this weaker sense. There is no apparent temporal reorganization apart from a slight adaptation to longer strings as noted above. Even with this reinterpretation the Invariance model applied to the English data appears to be too simple and cannot be maintained without additional assumptions. It also fails in certain more extreme cases of tonal crowding in Swedish, as discussed in Bruce (1977).
GOSTA BRUCE
This issue ties in with a debate in the early seventies about the timing patterns of Swedish accents, where the time course of F o in the stressed vowel was at stake. Evidence was found for both "truncation" ( = no temporal reorganization) and temporal reorganization depending on factors such as segmental context, direction of pitch change, dialect and idiolect (Eriksson and Alstermark 1972; Bannert and Bredvad-Jensen 1975). With respect to the alternative called Phonological Mediation, where prosodically induced lengthening but not speaking rate is assumed to modify the phonological representation (metrical grid), I agree with the authors that the actual problem is likely to have a phonetic rather than a phonological solution. We are dealing with partly gradient phenomena, which would seem to be more easily dealt with in phonetic terms, i.e. real time has to be taken into consideration. Another possible alternative discussed is the so-called Gestural Overlap (or Undershoot) model. (See Bruce 1977 for an extensive discussion of the phenomenon.) In the Swedish data (figure 6.2) there are clearly cases where no obvious temporal reorganization takes place - i.e. invariance in a weak sense - but where, in a crowded context, an upcoming fall seems to overlap with the preceding rise giving an earlier peak location (and also causing undershooting of the F o maximum) as compared to less crowded contexts. A related theory is the one termed Tonal Repulsion. Unlike gestural overlap, tonal repulsion implies a true temporal reorganization of the accent gesture(s), which may result in anticipation of the first gesture and also delay of the second gesture (cf. Bruce 1977). In the Swedish data presented above (figure 6.2) there is no apparent instance of tonal repulsion in the execution of the H*LH accent gesture, while in the English data there are cases with an earlier absolute timing of the H*, which can be readily accounted for in terms of tonal repulsion. This is the case also in some more extreme instances of tonal crowding in Swedish as noted above. Alternatively one could think that tonal structure is prior to temporal structure and that high tonal demands will give rise to lengthenings of the segments involved in order to accommodate the actual accent gestures. This theory has been advanced in the discussion of accentuation and final lengthening in Swedish (cf. Lyberg 1979; Ohman, Zetterlund, Nordstrand and Engstrand 1979) but has found little support in other investigations (Bannert 1982; Bruce 1983). In order to test the Gestural Overlap and the related Tonal Repulsion models the authors tried to estimate the distance between the accents. One would have expected the temporal distance between the prenuclear H* and the nuclear L* turning points to be an accurate measure of this distance between accents. They have instead used the duration of the stressed (prenuclear) vowel plus the duration of all intervening unstressed material up to the onset of the nuclear syllable. This is a fairly indirect measure of the inter-accent distance, as the actual tonal points both occur with considerable delay relative to these segmental points (onset
Alignment and composition of tonal accents
of prenuclear, stressed vowel and onset of nuclear syllable). According to their calculations this model is capable of explaining some of the variation in the timing of the English prenuclear High accent, but it is still clearly less successful than the peak proportion model. The final alternative considered in the paper relates the timing of the F o maximum to the sonority of the syllable and is an interpretation of the peak proportion model. The basic idea comes from the observation that prosodic lengthening affects the final, closing part of the syllable more than the initial, opening part, which under additional assumptions given in the paper could account for the observed peak location. This hypothesis suggests that accent timing is critical not only at a certain segmental reference point but also along the greater portion of a syllable rime. Consequently, I think it would also be natural to assume that it is the whole accent gesture rather than just one tonal point such as the F o maximum that is relevant and has a critical timing. Although a differential treatment of the parts of a syllable is suggestive and may be relevant for F o timing, the idea that the F o contour is mapped onto the syllable's sonority profile is still speculative in its present form.
6.5
Conclusion
The approach taken by the authors to the specific question about the timing of prenuclear high accents in English appears to have very general and interesting implications. I think that they have shown convincingly that an account of the coordination between rhythmical structure and tonal structure is better given in phonetic than phonological terms. There also seems to be evidence for the view that temporal structure is in some sense prior to tonal structure. Although we still do not understand in any detail how rhythmical and tonal structures are coordinated in speech, the work by Kim Silverman and Janet Pierrehumbert has pointed to the limitations of some of the more simple accounts and indicated possible ways of increasing our understanding of the problem.
References Bannert, R. 1982. An F0-dependent model for segment duration?. Reports from Uppsala University, Department of Linguistics no. 8, 59-80. Bannert, R. and A.-C. Bredvad-Jensen. 1975. Temporal organization of Swedish tonal accents: the effect of vowel duration. Working Papers 10, Phonetics Laboratory, Lund University, 1-36. Beckman, M. and J. Pierrehumbert. 1986. Intonational structure in Japanese and English. Phonology Yearbook 3: 255-309. Bruce, G. 1977. Swedish Word Accents in Sentence Perspective. Lund: Gleerup. 1983. Accentuation and timing in Swedish. Folia Linguistica 17(1-2): 221-238.
GOSTA BRUCE
1986. How floating is focal accent?. Paper presented at Nordic Prosody IV, Middelfart, Denmark, June 1986. Eriksson, Y. and M. Alstermark. 1972. Fundamental frequency correlates of the grave word accent in Swedish: the effect of vowel duration. Speech Transmission Laboratory-Q_PSR 2-3: 53-60. Lyberg, B. 1979. Final lengthening - partly a consequence of restrictions on the speed of fundamental frequency change?. Journal of Phonetics 7: 187-196. Ohman, S., S. Zetterlund, L. Nordstrand and O. Engstrand. 1979. Predicting segment durations in terms of a gesture theory of speech production. Proceedings from the 9th International Congress of Phonetic Sciences. Volume II, Institute of Phonetics, University of Copenhagen, 305-311. Pierrehumbert, J. 1980. The phonology and phonetics of English intonation. Ph.D. dissertation, MIT.
114
7 Macro and micro Fo in the synthesis of intonation KLAUS J. KOHLER
7.1
The fields of investigation
F o time courses can signal stress and intonation, and they do this with a certain degree of variability within the same linguistic pattern. Two complementary questions arise in connection with this variability. First, when do changes in the F o contour (e.g. the shift of an F o peak) across the same segmental string effect changes in linguistic patterning ( = macro Fo) ? Here three cases have to be distinguished. (a) In figures 7.1a and 7.1b, the position of the F o peak in the center of the vowel of the syllable "-lo-" (figure 7.1a), in the syllable "ge-" (figure 7.1b, left peak), or late in the syllable "-lo-" (figure 7.1b, right peak) of the German sentence " Sie hat ja gelogen." (= " She's been lying.") does not change the word and sentence stress, but alters the intonation with corresponding changes of meaning from "established" for the early to "new" for the central to "emphatic" for the late F o peak. See Kohler (1986b) and Kohler (1987) for experimental data. (b) In figures 7.2 and 7.3a, the position of the F o peak in the syllable "-la-" or "urn-" of German "umlagern" changes the verbal stress pattern from stem to prefix stress, but represents the same late intonation peak within each word accent. (c) Figures 7.3a and 7.3b show changes in the stress pattern from prefix to stem combined with changes in intonation from late to early peak. This paper deals with the perceptual interaction of these stress and intonation functions of macro F o . The second question concerns the changes in the F o contour that have to take place to guarantee the identity of a linguistic stress or intonation pattern across different segmental strings ( = micro Fo). Thus, figures 7.4a and 7.4b represent the same stress and intonation patterns (central intonation peak on the last, stressed
KLAUS J. KOHLER
A2
0.0000
n
time (rel) (sec)
0.9105
speech S ie
ha
t
j
a
g
e
I
pitch (Hz)
pitch (Hz)
Figure 7.1 (a) Speech wave and fundamental frequency (center peak) in the naturally produced utterance "Sie hat ja gelogen." The end contour (on the syllable gen) was added by Fo parameter manipulation because the analysis did not provide it. The time markers A1, A2 delimit the Fo peak contour (coinciding approximately with /o:/) which was shifted left and right in (b). (b) The left- and rightmost positions of the shifted Fo peak contour on the same time scale as in (a), approximating natural productions of early and late peaks (on ge- and -lo-), respectively.
n6
Macro and micro FQ in the synthesis of intonation 0.0000 time (rel) ' L (sec)
1.5181
speech
200
pitch (Hz)
0 time (abs) i (sec) 9.6312
11.1493
Figure 7.2 Waveform and Fo of the stem-stress utterance of "Er wird's wohl umlagern." with late intonation peak.
syllable) for the German sentences "Sie malt." [zi 'mailt] (= "She paints.") and " Sie schickt." [zi 'J\kt] ( = " She sends.") in spite of the quite different F o contours. Here, articulatory constraints cause local F o adjustments to the same underlying global intonation (macro Fo), i.e. the influence of high vs. low vowels and of voiceless vs. voiced consonants. If these micro F o differences are ignored in speech synthesis, four consequences may arise: the macro pattern changes, the linguistic identity of segments changes, the overall sound quality of the synthesized utterances changes, or there is no perceivable change at all (Silverman 1987). The two questions concerning macro and micro F o are empirical questions which can only be solved in the laboratory, and they have to be addressed simultaneously because the measurable speech production output is the result of both factors, and it must be our aim to abstract the linguistic macro F o patterns from this output and to provide a set of rules for microprosodic adjustments. Proceeding in this way we not only improve our knowledge of prosodic structures in different languages, but we also lay a firm foundation for the improvement of speech synthesis, e.g. in text-to-speech systems. At the same time, the use of speech synthesis as a research tool allows us to test hypotheses about the contribution of F o to the macro and the micro patterning. 117
KLAUS J. KOHLER
1.5181
0.0000 time (rel) L (sec)
speech
a 200
pitch (Hz)
0.0000
1.5181
time (rel) (sec)
speech
b 200
pitch (Hz)
Figure 7.3 Waveform and Fo of the original prefix-stress with late intonation peak (a) and of the original stem-stress with early intonation peak (b) in "Er wird's wohl umlagern." A, B, C mark the base and peak points of the Fo peak contour for the Fo shifts.
n8
Macro and micro Fo in the synthesis of intonation time (rel)
0.8375
0.0000 i L
(sec)
speech
a 160 n
pitch (Hz)
0.9027 time (rel)
<
i
i
i
i
i
speech
b
160-
pitch (Hz)
Figure 7.4 Waveform and Fo of "Sie malt." (a) and "Sie schickt." (b) with stress and central Fo peak on the second syllable in each case.
119
KLAUS J. KOHLER
Changes in the terminal falling macro pattern may involve the following parameters: — the positions of F o peaks along the time scale in relation to the syllable and segment chains — the number of F o peaks — the absolute and relative heights of F o peaks — the shapes of F o peaks: slow/fast rises/descents — the precursor before the first peak — the tail. The first part of this paper deals with macro F o , including the variables of F o peak alignment and of F o peak shape. The number of F o peaks is restricted to one and its height kept constant; precursor and tail are dependent on the alignment variable. The two variables are studied with regard to their influence on stress and intonation perception in German. Changes in the micro F o may involve the following parameters: — vowel intrinsic F o due to vowel height — contextual F o due to preceding consonants — contextual F o due to following consonants — F o masking in voiceless segments. The second part of this paper looks at the preconsonantal micro F o with regard to its influence on the perception of the following segment, an effect that has largely been ignored in microprosodic studies. The conclusions reached in this part of the paper are based on data from both German and English.
7.2
Macro F o
7.2.1 Introduction In view of the fact that a shift of the F o peak position from one syllable to another can, but need not, change the stress position in a syllable chain and can also alter the sentence intonation, two questions arise: (a) Under what conditions is an F o peak shift (without concomitant changes in sound duration, intensity and other cues) sufficient to shift stress to a different syllable? (b) How can the stress and intonation functions of F o peaks be differentiated, and in what ways do they interact? To provide answers to these questions two experiments were carried out in German, which offers good examples for testing the issues because it has minimal verb pairs, with either prefix or stem stress, which can occur in the same natural
Macro and micro Fo in the synthesis of intonation
sentence frame, e.g. "Er wird's wohl umlagern." (with stress either on "urn-" /'urn/ = "verlagern," "He is presumably going to shift it to another place."; or on "-la-" /'la:/ = "belagern," "He is presumably going to besiege it.").
7.2.2
Procedure
Two utterances of this sentence, (a) with stress on "um-" and a late intonation F o peak on this syllable, and (b) with stress on "-la-" and an early intonation F o peak, which actually falls on the syllable "um," were selected for stimulus construction from a large corpus containing several repetitions of all the six combinations of two stress and three intonation positions, spoken by a trained phonetician (the author). The two tokens were analyzed using the same procedure as in Kohler (1986b). Figures 7.3a and 7.3b present the waveforms together with their F o displays. The F o peak positions in the two utterances are practically identical in relation to the syllable structures of " umlagern ": they occur at more or less the same time interval just before the beginning of / I / . The differences between the two are in the shapes of the F o peak contours and in the syllable durations. In the utterance with stem stress in figure 7.3b the post-peak F o descent is more gradual. Also, the "um-" is much shorter (135 ms in figure 7.3b vs. 222 ms in figure 7.3a) so that, although the initial (rising) portion of the peak (segment AB) is also somewhat shorter, the onset of the rise is much earlier relative to the "um-" syllable, occurring at the beginning of the / I / in "wohl" rather than at the beginning of the "um-" syllable itself, as in the utterance with prefix stress (figure 7.3a). The "-la-" syllables, on the other hand, have very similar durations in the stem and prefix stress words (268 ms in figure 7.3b vs. 258 ms in figure 7.3a). Subsequently, the F o peak contours of the two utterances were exchanged and adjusted to comparable points in the segmental structures. Figures 7.5a and 7.5b show the waveforms of figures 7.3a and 7.3b with the new exchanged F o contours. Finally, the following F o parameter manipulations were performed: (1) In the stimulus of figure 7.3a (original prefix stress), the whole peak contour between the marks A and C was shifted to the right along the time axis in 6 equal steps of 30 ms; the tail of the F o contour beyond mark C was then timecompressed between the new later time position C and the end of periodicity, and the F o precursor in "wohl" was time-expanded from its beginning point to the new time position A'. In preparing the remaining stimuli in this series, the left branch of the peak contour (AB) was shifted to the left in 5 equal steps of 30 ms; the right branch of the peak contour was then time-expanded between the new earlier time position B' and the time mark C, and the precursor was timecompressed between its beginning and the new earlier time position A'. When A' fell to the left of the beginning of "wohl" the section of the contour that thus entered the voiceless stretch was masked. Figures 7.6a and 7.6b illustrate the right
KLAUS J. KOHLER
0.0000
1.5181
time (rel) (sec)
speech
vo I
u
m
I
a:
g
a 200 n
pitch (Hz)
0.0000
1.5181
time (rel) (sec)
speech
b 200-1
pitch (Hz)
Figure 7.5 Waveform of the original prefix-stress (a) and of the original stem-stress (b) utterance of "Er wird's wohl umlagern." in figure 3 with exchanged Fo contours, adjusted to the different timing of the new utterance. A, B, C as in figure 3.
Macro and micro Fo in the synthesis of intonation
and left shifts, respectively. In the leftward shifts, the fall was stretched out and thus flattened because that is what was found in the natural productions of the early-peak contours of "Sie hat ja gelogen." (see figure 7.1) and of "Er wird's wohl umlagern." with stem stress (see figure 7.3b). Furthermore, the flattening of the F o descent improved the quality of the synthesized stimuli because it prevented the F o tail from becoming too long. (2) In the stimulus of figure 7.3b (original stem stress), the whole peak contour between the marks A and C was shifted to the left in 8 equal steps of 30 ms; the tail of the F o contour beyond mark C was then time-expanded between the new earlier time position C and the end of periodicity. As regards the left-branch adjustment the same procedure was followed as in the leftward shifts of (1). In the construction of this stimulus set, the post-peak F o fall was not stretched out (as it was in (1)), because a pilot test had shown that a further flattening of the already less abrupt F o fall in the original stem-stress utterance with an early peak does not effect a shift to prefix stress. So, if a shift was to occur at all the F o descent would have to take place outside the stem syllable. (3) In the stimulus of figure 7.5a (original prefix stress with transferred F o peak shape), the same F o peak shifts were carried out as in (1). (4) In the stimulus of figure 7.5b (original stem stress with transferred F o peak shape), the same procedure was followed as in (2). From these parameter manipulations, there resulted 12 F o contours each, with peak positions from near the beginning of "urn-" to the second half of "-la-," in (1) and (3), and 9 F o contours each, with peak positions from the beginning of "wohl" to near the end of "urn-," in (2) and (4). These F o contours entered into a stimulus synthesis with the LPC-derived formant and volume values of the original prefix-stress utterance in (1) and (3), and with the corresponding data of the original stem-stress utterance in (2) and (4). In each case, two test stimulus sets were thus generated, with a slowly and an abruptly falling F o peak contour, respectively: (2), (3) vs. (1), (4). The slow fall is characteristic of the early-peak intonation illustrated in figure 7.3b. Series (2) is based on the utterance in figure 7.3b, i.e. the stem-stress pattern with early peak, whereas series (3) is based on the utterance in figure 7.5a, which has the prefix-stress segments and duration pattern combined with the slowly falling early-peak shape transferred from the utterance in figure 7.3b. The abrupt fall was characteristic of the late-peak intonation illustrated in figure 7.3a. Series (1) is based on the utterance in figure 7.3a, i.e. the prefix-stress pattern with late peak, whereas series (4) is based on the utterance in figure 7.5b, which has the stem-stress segments and duration pattern combined with the more steeply falling late-peak shape transferred from the utterance in figure 7.3a. 123
KLAUS J. KOHLER
0.000 time (rel)
(sec)
1.5181
•
-;
speech
time (rel) (sec) 200
a pitch (Hz)
Figure 7.6a for legend see opposite.
In (1) and (3), the F o peak positions straddle the syllable structures where a change from prefix to stem stress is to be expected if F o is a sufficient cue. The two sets differ in that the peak shape of (3), but not of (1), approximates the configuration found in the early peak of the original stem-stress utterance (see figure 7.3b). It is hypothesized, therefore, that if stress is perceptually shifted at all in (1) and (3), there will be a more clear-cut change in (1) because there is a higher probability in (3) that an F o peak position on "urn-" can not only be perceived as a central or late peak with prefix stress but also as an early peak with stem stress. The same would apply to (2) as against (4). To check these hypotheses two test tapes were compiled: (I) containing the 12 stimuli of (1) and the 9 of (4), (II) containing the 12 stimuli of (3) and the 9 of (2). (I) was produced in a short version with 5 repetitions of the 21 stimuli, and in a 124
Macro and micro Fo in the synthesis of intonation 1.5181
0.000 t i m e (rel)
>•
(sec)
speech
time (rel) i (sec) 200
L
pitch (Hz)
Figure 7.6 The original Fo peak position of the prefix-stress utterance with late intonation peak (figure 3a) together with the 6 rightward shifts (a) and together with the 5 leftward shifts (b) in the construction of test series (1).
long version with 10 repetitions, with separate randomizations of the 105 and 210 test stimuli, respectively. (II) was only produced in a short version. Each stimulus sentence was preceded by a bleep and followed by a 4-s pause in which subjects were to answer, by ticking the appropriate boxes on prepared response sheets, whether the meaning of the perceived stimulus was "belagern" (stem stress) or "verlagern" (prefix stress). Test (I) was done by 18 subjects in its long version and by 9 in its short one; 4 of the 18 deviated in their responses by judging the 9 stimuli of (4) exclusively as "verlagern." They were, therefore, dealt with separately and not included in figures 7.7 and 7.8. Test (II) was taken by 16 subjects - some of whom had done test (I) - in later sessions. The subjects listened to the test tapes in several subgroups via a loudspeaker in a sound-treated room of the Kiel Phonetics Institute. 125
KLAUS J. KOHLER
7.2.3
Results and discussion
Figures 7.7 and 7.8 present the results from these experiments for the 12-stimulus sets with the original prefix-stress duration pattern (series H (1), (3) and for the 9stimulus sets with the original stem-stress duration pattern (series H (2), (4), respectively. In the shift of the more sharply falling (original) F o peak contour through the original prefix-stress utterance (see the unbroken line in figure 7.7), there is a clear change from initial to stem stress, in spite of the duration of "urn-" pointing to the former. F o can thus override duration, particularly since the duration of the unstressed "-la-" syllable in the original utterance is very close to its duration under stress. In stimulus 10, which is the first in the ordering from 1 to 12 to yield an unequivocal stem-stress categorization with over 80% positive responses, the F o peak position is 30 ms. into the vowel of the syllable u -la-." This corresponds to the data, presented in Kohler (1986b) and Kohler (1987), concerning the change from an early to a central intonation peak on the stressed syllable. The fact that the change from one stress category to the other is gradual rather than categorical can be related to some interaction of the stress and intonation functions of F o because the more sharply falling F o peak assumes positions before the beginning of the syllable nucleus /a:/ of "-la-" which can simultaneously function as the central or late intonation peak in stressed "urn-" and as the early intonation peak in stressed "-la-." When the more slowly falling F o peak is substituted (see the broken line in figure 7.7) the initial-stress category is not clearly represented: the interpretation of an early intonation peak for stem stress is never completely precluded. When an F o peak contour is shifted through the original stem-stress utterance (see figure 7.8) there is no change between the stress categories: the answers remain predominantly in favor of stem stress. In this case, F o can thus not override the duration cue completely because "urn-" is too short in relation to "-la-" to signal initial stress. There is some effect of F o when the more sharply falling F o peak (continuous line in figure 7.8) occurs actually within the syllable "urn-." That is, in stimuli 1 to 5, the F o peak has been shifted leftward all the way into the preceding syllable " wohl," whereas in 6 to 8 it has been moved only as far back as some point within the prefix syllable "urn-," and in these stimuli there are up to 30% judgements of prefix-stress. This pattern can be interpreted as meaning that the overriding salience of the duration cue is checked somewhat when the characteristic peak contour occurs in the relevant syllable, allowing the interpretation of the transferred abruptly falling late-peak intonation contour as a central or late peak on "urn-." In the other series (broken line in figure 7.8), however, the slowly falling F o contour reduces the possibility of interpreting the peak as a central or late peak for a prefix-stress pattern, because of the interference of the original early peak intonation type. These response curves can be contrasted to the four subjects that behaved differently and were not included in the figure. 126
Macro and micro Fo in the synthesis of intonation % "belagern"
-HOO
-80
-60
-20
1
2
3
4
5
6
7
8
9
10
11
12
Figure 7.7 Percentage stem-stress responses for "umlagern" ( = "belagern," i.e. stem stress) in the series of 12 Fo peak positions (from left to right) combined with the original prefix-stress utterance of "Er wird's wohl umlagern."; original, sharply falling peak contour (continuous line, at each data point N = 14x10 + 9 x 5 = 185), and slowly falling peak contour, transferred from the original stem-stress utterance (broken line, at each data point N = 16 x 5 = 80).
Those subjects seemed to have been guided only by the F o without any reference to the short duration of the unstressed prefix syllable, and therefore perceived prefix stress on all 9 stimuli because of the presence of the F o peak before the beginning of the u -la-" syllable. The hypotheses that led to the experiments discussed above have thus been confirmed, and the questions asked initially can be answered as follows: (a) An F o peak shift by itself is sufficient to bring about a clear change from one stress position to another, provided the duration of the stressed-syllable-to-be toward which the F o peak is shifted is not too short. But even when it is, there is a residual F o effect. (b) The intonation function of F o interferes with its stress function if the latter is not supported by duration. This finds its expression in a gradual change from one stress category to another over a stretch of utterance where the positions of a central intonation peak in one stressed syllable and an early intonation peak related 127
KLAUS J. KOHLER
"belagern
•-100
-
o— cr
-80
N -60
-40
-20 -
stim nr i
i
i
i
i
i
i
i
i
Figure 7.8 Percentage stem-stress responses for "umlagern" ( = "belagern," i.e. stem stress) in the series of 9 Fo peak positions (from left to right) combined with the original stem-stress utterance of "Er wird's wohl umlagern."; original, slowly falling peak contour (broken line, at each data point N = 1 6 x 5 = 80), and sharply falling peak contour, transferred from the original prefix-stress utterance (continuous line, at each data point N = 14 x 10 + 9 x 5 = 185).
to a stressed syllable following can coincide. This interaction is strengthened when the shape of the F o peak contour approximates the more slowly falling one of the early intonation peak of a later stress.
7.3
Micro F o
7.3.1. Introduction The importance of F o after stop release as an acoustic cue for the lenis/fortis categorization of stop consonants has been known for a long time (Hombert, Ohala, and Ewan 1979). F o preceding the stop closure, on the other hand, has not been attributed a similar cue value. For German it has been demonstrated in a number of experiments with the utterances "Diese Gruppe kann ich nicht leiden/leiten." ("I cannot stand/lead this group.") that in production as well as in perception a level and a level + falling F o contour on the pre-stop vowel are cues 128
Macro and micro Fo in the synthesis of intonation
for / t / and /d/, respectively (Kohler 1985). These results have been only partially replicated for English in the utterances "I am telling you I said widen/whiten." with very much smaller effects (Kohler 1986a). This difference was related to the fuzziness of the segment boundary in /w/ + /ae/ as against / I / + /ae/ and to the fact that the long initial formant transitions characteristic of glides have been found to increase the perceived duration of a following vowel. To test this hypothesis, three perception experiments were carried out. In the first one, the previous German experiment was repeated (a) with another German group in order to demonstrate the generalizability of the discovered signal/ perception link for German, (b) with a group of British English speakers in order to uncover perceptual differences due to language background. This experiment was also intended to establish a base-line for the other two experiments, which (1) replicated the segmental chain and the F o patterns of the German test items (/'laedn/ —/'laetn/) in an English sentence frame, and (2) compared its results with those for /'waedn/ — /'waetn/.
7.3.2 7.3.2.1
Experiment 1 PROCEDURE
The test tape of experiment 2 of Kohler (1985) contained a randomization of 10 repetitions of 21 stimuli "Diese Gruppe kann ich nicht leiden/leiten.", being three sets of the same 7 complementary vowel and following silence durations from clear / d / to clear / t / in the utterance-final word, combined with three F o patterns across the stressed vowel /ae/: level + falling, level, and continuously falling. This tape was presented to a new group of 16 native speakers of German (students of phonetics and languages), in several subgroups, via a loudspeaker in a soundtreated room of the Kiel Phonetics Institute. They classified the stimulus utterances as "leiden" or "leiten" sentences by ticking the appropriate boxes on prepared answer sheets. The same test under the same conditions was performed by 13 British English speakers in two subgroups, but they gave their answers by pressing one of two buttons at the recording stations of a reaction-time measurement system. They were students of German spending six months in Kiel to improve their proficiency in the language. 7.3.2.2
RESULTS AND DISCUSSION
The German group replicates the results of the previous test (cf. Kohler 1985; 24ff) in every respect: figure 7.9 again shows that level F o introduces a /t/, level + falling F o a / d / bias, compared with the continuously falling pattern. As can be seen in figure 7.10, the English group also shows clearly separate identification functions for level and falling F o . But the English subjects show a higher percentage of / d / responses than the German subjects in the middle of the 129
KLAUS J. KOHLER %/d/ responses 100-r
o
o Falling
D
D Level
A*---^
50--
"0.55
0.60
0.65
0.70
0.75
Level-I-falling
0.80
V V+C
Figure 7.9 Percentage / d / responses as a function of vowel/(vowel + closure) duration ratio for the 3 Fo conditions in experiment 1 ("leiden/leiten," German group), and binomial confidence ranges at the 5% level; 16 listeners. At each data point N = 160.
duration ratio range for both level and continuously falling F o , and the response curves for falling and level + falling F o , which are already close together in the data of the German group, coalesce in the English group in this upward shift of two of the identification functions. This means that the English subjects show the same perceptual effects with regard to level F o as against the other two F o patterns, but 130
Macro and micro FQ in the synthesis of intonation %/d/ responses 100-r
50--
'0.55
0.60 V+C
Figure 7.10 Responses of the combined British English group in experiment 1. At each data point N = 130.
that they nevertheless locate the duration ratio boundary at a lower value than the German listeners. The reason for this difference may be that because English speakers generally devoice the nasal plosion after fortis stops, the absence of this feature in the German test stimuli biases English listeners towards / d / in the middle of the duration ratio range.
KLAUS J. KOHLER
7.3.3 7.3.3.1
Experiment 2 PROCEDURE
Two English sentences were constructed that replicate the focal and utterancefinal position as well as the segmental structure and the phonetic context of the German test words in experiment 1. The two family names "Lyden" and "Lighton," which are of equal (low) frequency in Britain, were inserted in the sentence frame "I think you'd have to ask..." They contain the same phoneme sequences as the German words and can also be realised with nasal plosion. They, too, occur after a voiceless consonant cluster that interrupts the F o glide from a low value on "ask" to a high one in the contrastively stressed name so that F o has practically reached its peak value when it sets in again at voiced / I / onset. These sentences were pronounced several times by a native speaker of Southern British English, with focus stress on the name, elicited by the context "Who do you think would know about this, Lyden or Lighton ?" The F o contours across the names were very similar to those found in the German sentences of experiment 1 (cf. Kohler 1985: 24): before the lenis stop F o drops much further in the stressed vowel than before the fortis. One token of a "Lyden" sentence was selected for the test stimulus generation, which followed the principles laid down in Kohler (1985). The stressed vowel measured 289 ms, the following stop closure duration 46 ms, and the stop release and aspiration 24 ms. Three F o patterns were generated across the stressed vowel: (a) level-falling (122-120-75 Hz) with the fall beginning at the vowel center, (b) level (122-120), (c) linearly falling throughout (122-75 Hz). These F o contours were combined with 7 rate-manipulated vowel durations, from 260 ms down to 200 ms in 10-ms. steps. The closure voicing and release were excised and replaced by silence, which was increased from 70 ms up to 160 ms in 6 equal steps, complementary to the vowel shortening. The 21 vowels (3 F o contours x 7 durations) produced in this manner, together with the complementary closure pauses, were spliced into the carrier utterance. Thus the durations and F o patterns of the resulting 21 "Lyden/Lighton" stimuli were fully comparable to those generated in the German test, the only differences being that after the silence F o set in at 70 Hz (instead of 66 Hz) and that the periodicity of the nasal was more regular and of much greater amplitude than in the German "leiden/leiten" stimuli, i.e. there was proper and strong voicing instead of creak. Since the frame was not synthesized, the stimuli sounded completely natural, and no "synthetic" quality was detectable in the synthesized vowel sections either. The 21 stimuli were copied 10 times and randomized to give a test of 210 stimuli, following the same procedure as in the German test. The 13 native British English speakers of experiment 1 acted as informants under the same listening conditions in two subgroups. They classified the stimulus utterances as "Lyden" or "Lighton." 132
Macro and micro Fo in the synthesis of intonation
7.3.3.2
RESULTS AND DISCUSSION
The results in figure 7.11 are basically congruent with the English group results of experiment 1: the identification curves occupy more or less the same positions along the duration ratio axis, the functions for the two falling F o sets are again not differentiated from each other, but are clearly separate from the function for level F o , which yields significantly more / t / responses. The difference between the two experiments lies in somewhat more / d / judgements in the lower half of the duration ratio scale for experiment 2. So there must be some essential acoustic difference between the English "Lyden/Lighton" and the German "leiden/ leiten" stimuli. The obvious candidate is the strong voicing instead of creak in the final nasal of the English utterances. It provides a more prominent release cue for /d/, which may enter into conflict with the fortis cues and weaken their effects, i.e. the effect of flat F o generally and the effect of duration in the shorter range. It has also been shown by Kohler and van Dommelen (1987) that different voice qualities affect the perception of lenis and fortis consonants. 7.3.4 Experiment 3 7.3.4.1 PROCEDURE The sentences "I am telling you I said widen/whiten." were pronounced several times with focus stress on the final word and with nasal plosion by the same native Southern British English speaker that produced the utterances for experiment 2. One "widen" token was selected for constructing 21 test stimuli according to the same principles as in experiments 1 and 2. The vowel durations ranged from 265 ms to 205 ms, the silence durations from 70 to 160 ms. Again 3 F o patterns were generated with each vowel duration. In the level + falling F o pattern the level section was represented by the naturally produced fluctuation between 119 and 123 Hz over the first 100 ms of the original vowel, followed by a linear fall to 85 Hz, the proportion of level and slope sections staying the same in all 7 stimuli. The first 100 ms of the level F o were identical with the level section of the level 4- falling pattern in the longest vowel and changed proportionally with the vowel duration; the remainder descended to 122 Hz. In the third pattern, F o fell linearly throughout from 119 to 85 Hz. The original / d / release was again eliminated, and the 21 synthesized vowels + closure pauses were spliced into the sentence frame. F o at voice onset of the final nasal was 89 Hz, descending to 69 Hz. The very large amplitude of the regular periodicity in / n / was adjusted to the one found in "Lyden" by applying the reduction factor 0-35. The durations and the F o patterns were comparable to the ones in the test stimuli of experiments 1 and 2, but with important differences in the height of the pre- and postconsonantal F o ending and starting points. The test tape construction and the running of the experiment followed the same 133
KLAUS J. KOHLER
%/d/ responses 100-T-
50--
'0.55
0.60
Figure 7.11 Responses of the combined British English group in experiment 2 ("Lyden/Lighton"). At each data point N = 130.
lines as in experiment 2. A previous run of the test was reported in Kohler (1986a). It was repeated here by the same two British English subgroups as in experiments 1 and 2. In a pretest, each of the 13 subjects was examined as to whether they distinguished "wh" from "w". Two informants did and were, therefore, excluded from the test because their expectations for "whiten" would have been different.
Macro and micro Fo in the synthesis of intonation %/d/ responses 100-r
50--
'0.55
0.60
v+c Figure 7.12 Responses of the combined British English group in experiment 3 ("widen/whiten"). At each data point N = 110.
7.3.4.2
RESULTS AND DISCUSSION
Figure 7.12 provides the data for the combined group. There are no inter-group divergences: the differences between the three F o patterns have practically disappeared. The effect of flat F o , which was still slightly present in the previous run of the same test, has been leveled out. Otherwise the two test runs provide corresponding locations of the identification functions. Since it is only the 135
KLAUS J. KOHLER
response curve for flat F o that is positioned differently in the "Lyden/Lighton" and the "widen/whiten" data, the initial consonant /w/ cannot be responsible for the increase of /d/ judgements. It must be an acoustic feature difference that is peculiar to the flat F o stimuli. In "Lyden/Lighton", F o is flat across the stressed syllable, and a rise from the preceding syllable is masked by voicelessness; after the closure silence, F o resumes at its low utterance-final value. The flat F o contour is thus bounded by voiceless stretches on both sides, with low F o preceding and following. In this environment, the high flat F o , i.e. the fortis cue, becomes perceptually salient. In "widen/whiten", on the other hand, there is an upward F o glide from the low value of the preceding syllable right into the stressed vowel, and it is only the final 130-160 ms that are actually flat. After the closure pause, there is a substantial F o fall of 20 Hz. In this context, the high flat F o is integrated into a macroprosodic rise-fall pattern and is, therefore, perceptually far less salient, thus losing its fortis cue strength. 7.3.5
General discussion
The results of the three experiments point to the following prosodic influences on lenis/fortis stop perception in German and English. 1. A flat F o across a stressed pre-stop vowel in a focused utterance-final disyllable is a fortis cue, compared with falling F o patterns, in both German and English, as long as the flat F o is clearly detachable from a macroprosodic utterance intonation as a microprosodic manifestation. In German, a flat + falling F o is also differentiated from a continuously falling F o as a stronger lenis cue. 2. In English, the category boundary between lenis and fortis is located at lower duration ratios. This leads to a coalescence of the identification functions for flat + falling and continuously falling F o contours. 3. A stop release into regular voicing of high amplitude and an F o fall (below the focus peak) weakens the preconsonantal microprosodic fortis cue. 4. The microprosodic effects of pre-stop flat and flat + falling F o are obliterated when they are integrated into macroprosodic utterance pitch patterns.
7.4 Conclusions As regards the global utterance F o of such languages as German and English, it is necessary to distinguish between an intonation and a stress function at the macro F o level. In German, the shift of an F o peak contour, over a total time stretch of 300-400 ms, from the center of a syllable nucleus to its right-hand boundary or to a syllable preceding it, which is unequivocally signaled as unstressed by its segmental qualities and quantities, does not change the stress position but cues different intonations related to the semantic features "established" vs. "new" vs. "emphatic." On the other hand, an F o peak shift by itself is also sufficient to bring 136
Macro and micro FQ in the synthesis of intonation
about a clear change from one stress position to another, provided the duration of the stressed-syllable-to-be is not too short. The shape of the F o peak, in addition to its temporal location in the syllable structure, is a further cue for the signaling of the stressed syllable. This stress function of F o shows an interference from its intonation function (i.e. the early, central or late peak in relation to the same stress position) if it is not supported by duration. Similar findings regarding the stress and intonation functions of F o , as established for German, are to be expected for English and other languages that differentiate stressed from unstressed syllables and associate meaning-related utterance intonations with stressed syllables. I would even hypothesize that the use of early vs. later F o peaks to signal the "established" vs. "new" dichotomy is quite wide-spread. If, however, the contrast of temporal peak alignment has already been assigned a function at some other linguistic level, as in the Scandinavian word accents (cf. Garding 1982), its intonational use for the utterance-semantic categories is either precluded or has to undergo some language-specific modification. It is an interesting empirical question to study the phonetic realization of the intonational categories described for German in a language such as Swedish. Beside these language-specific macroprosodies of stress and intonation, there are the universal microprosodic F o adjustments due to the articulatory constraints imposed by segments at each point in the utterance. Thus, central peak contours in stressed vowels of German and English show perceptually relevant microprosodic influences from following as well as preceding fortis or lenis consonants, respectively. In both cases, F o is raised by fortis consonants. But the perceptual effects of a pre-stop F o raising only show up when the pattern is clearly detachable from, i.e. contrasts with, a macroprosodic utterance intonation as a microprosodic manifestation. Although microprosodic influences arise both from prevocalic and postvocalic consonants and can become perceptual cues to segmental identity, their effects are more easily overridden by the utterance prosody in the preconsonantal position. When the segmental cues to the lenis/fortis distinction (voicing, aspiration, duration) disappear through sound change the microprosodic differences may be preserved or even heightened and thus lead to tonal distinctions. But as the preconsonantal F o effect is better controlled by the global F o than the postconsonantal one, such tonogenesis is hardly attested in this context.
References Garding, E. 1982. Swedish prosody. Summary of a project. Phonetica 39: 288-301. Hombert, J. M., J. J. Ohala and W. G. Ewan. 1979. Phonetic explanations for the development of tones. Language 55: 37—58. Kohler, K. J. 1985. F o in the perception of lenis and fortis plosives. Journal of the Acoustical Society of America 78: 21-32. 137
KLAUS J. KOHLER
1986a. Preplosive F o in the perception of / d / —/t/ in English. Proceedings of the Montreal Symposium on Speech Recognition, McGill University, Montreal, Canada: 34-35. 1986b. Computer synthesis of intonation. Proceedings of the 12th International Congress on Acoustics, vol. 1, Toronto, Canada: A6-6. 1987. Categorical pitch perception. Proceedings of the 11th International Congress of Phonetic Sciences, vol. 5, Tallinn, 331-333. Kohler, K. J. and W. A. van Dommelen. 1987. The effects of voice quality on the perception of lenis/fortis stops. Journal of Phonetics 15: 365-381. Silverman, K. 1987. The structure and processing of fundamental frequency contours. Ph.D. thesis, Cambridge University.
138
8 The separation of prosodies: comments on Kohler's paper K I M E. A. S I L V E R M A N
Kohler's experiments concern the mapping between the acoustic/phonetic detail of spoken utterances and their underlying phonological structure. A core problem in describing spoken language is that some patterns that are almost identical are perceived as being quite different, while at the same time other patterns that differ immensely are nevertheless classified by listeners as being the same. Kohler addresses two specific manifestations of this problem in the relationship between F o contours and the utterances that carry them.
8.1
Macro F o
His first experiment utilizes the minimal stress pair UMlagern and umLAGern (meaning roughly "to relocate" and "to surround," respectively). When an F o peak occurs between the first and second syllables - on the boundary between the / m / and the / I / - then this can indicate either a "late peak" associated with UMlagern or an "early peak" associated with umLAGern. Hence two different underlying intonational categories, when combined with two different lexical stress locations, will result in apparently the same F o pattern. Yet listeners are able to recognize the intended word. Kohler explores the question of how listeners recover the underlying forms (specifically where the primary lexical stress falls) from the combination of F o with other characteristics of the acoustic signal; he uses linear predictive synthesis technology to produce versions of the utterances in which the F o peak occurs earlier and later than the syllable boundary. The main result seems to me to challenge the widespread assumption that perception of F o proceeds in parallel with and more or less independently of perception of the rest of the speech signal. Specifically, Kohler demonstrates one way in which the speech perception strategy makes use of other aspects of the signal in combination with F o : even when the peak is located early in the /urn/, that syllable will not be classified as stressed unless its duration, amplitude, voice quality and spectral characteristics (which are all jointly carried by the linear predictive coefficients) are
KIM E. A. SILVERMAN
consistent with that. On the other hand /lag/ will not be classified as stressed, even if all of these characteristics are present, unless the F o level in it also is sufficiently high. This result indicates that F o contours are not particularly meaningful when considered in isolation; it is only in combination with other characteristics of the linguistic structure that their phonological composition can be unambiguously identified. A given intonational structure will affect more than just F o , and an F o contour by itself (considered without reference to the segmental structure which carries it) is not always sufficient to completely specify its tonal makeup. This is not a new message, of course. But the evidence for it has largely been either intuitive or anecdotal. For example, during a research project concerned with pragmatic interpretation of intonation contours (Scherer, Ladd and Silverman, 1984) I heavily low-pass filtered a set of utterances excised from a corpus of interviews (I used a cut-off frequency of about 150 Hertz with a 60 decibel/octave rolloff) in order to render them unintelligible. In certain cases the filtered versions clearly sounded as if they had a different prosodic structure from their original unfiltered counterparts, despite the fact that their F o , duration, and gross amplitude variation remained intact. This phenomenon was largely due to my having removed many cues concerning how the F o contour was aligned with the segmental structure. A related manifestation of this interdependence of F o and the rest of an utterance is often encountered when we attempt to alter the intonation in a digitized utterance by manipulating the F o contour prior to linear predictive resynthesis. Almost any new intonation contour can be imposed, but one change that is often unsuccessful is when we attempt to move the intonational nucleus a significant distance earlier in the utterance (such as trying to turn They wanted to write about TIMING into They WANTED to write about timing). It is common for the formerly-nuclear word (in this case timing) to persist in sounding somewhat accented or unnaturally prominent, because of its duration, amplitude, vowel quality, and voice quality. However, although the nonautonomous nature of F o contours is not a new discovery, it is one that needs to be reiterated for both its applied and its theoretical implications. In the area of automatic speech recognition, for example, it means that the common approach - in which the to-be-recognized utterance is stretched or contracted, segment by segment, until it achieves a satisfactory spectral match with a previously-stored abstract pattern (Dynamic Time Warping and most current applications of Hidden Markov Modeling) — is doomed from the outset to fail in many contexts because it effectively discards the information carried by the durational structure and F o . It similarly means that automatic recognition of intonation will need to be based on more information than just F o. Conversely, in the area of speech synthesis, it means that if a particular intonation 140
The separation of prosodies
contour or accent location is desired on an utterance then it will not be sufficient to change only the F o contour without also introducing concomitant variation in the other parameters. This latter example brings us to the theoretical questions of what belongs in the phonological representation and how it is used in the phonetic implementation process. Kohler discusses the result as an interaction between the stress function and intonation function of F o . An alternative view is that F o does not directly have a stress function.1 In this view, intonation is a sequence of phonological units (including pitch accents and boundary phenomena) that jointly make up the melody of an utterance. Stress, by contrast, is an abstract property of particular syllables which specifies, amongst other things, how intonation can be aligned with a text - namely, pitch accents can only be aligned with stressed syllables. Phonetically, stress pervades all acoustic parameters (or at least all that have so far been measured). Thus a stressed syllable will have more extreme formant values, greater duration, a steeper closing phase of the glottal waveform which results in greater amplitude and more high-frequency energy in the spectrum, and so on. Listeners in some sense "know" the phonological inventory of possible pitch accents, as well as the structural constraint that a pitch accent cannot be linked to an unstressed syllable, and on the basis of this knowledge they are unlikely to judge an originally unstressed /urn/ as stressed even when it carries an F o peak (unless it also bears a sufficient number of the other correlates of stress). This view seems to run into difficulty with Kohler's result concerning the syllable /lag/. When the utterance was spoken with the lexical stress on /urn/ (and hence /lag/ was not accented) but the F o peak was synthetically moved 120 ms. to the right (stimulus number 10 in Kohler's figure 7.6), well into the /lag/, most of the listeners did indeed judge the /lag/ as carrying primary stress. So here F o did seem to have a stress function, despite the contradictory lpc parameters. Kohler attributes this to the observation that the originally unaccented /lag/ was almost as long as the stressed one, and so its duration was not inconsistent with its bearing a pitch accent. No doubt this durational similarity is at least largely responsible for the listeners' judgments. But why should there be such a durational similarity in the case of accented versus unaccented /lag/, while at the same time there is no corresponding similarity in accented versus unaccented /urn/ ? I would like to illustrate two different possible classes of explanation: one that attributes it to a posited similarity in the underlying phonological representations, and one that sees it as a result of the processes by which two supposedly different phonological representations are given their phonetic realization. An example of the first type of explanation would be that the stems of German prepositional verbs are always marked as stressed in their underlying lexical forms. The prepositional prefix, on the other hand, is only stressed in those words where 141
KIM E. A. SILVERMAN
the preposition bears a pitch accent. According to this account, part of the underlying representation of these words would thus be: pitch accent: stress: syllable:
X
X
X
X
X
X
X
X
X
umLAGern
UMlagern
Thus in UMlagern there are really two stresses: the /urn/ is stressed and accented (and hence is the most prominent syllable), but at the same time although /lag/ is not accented it is still underlyingly stressed. Therefore it is long. Note that this assumes that it is stress, not accent, which determines a syllable's duration. In combination with this assumption, this explanation predicts that (in the case of UMlagern /umLAGern) accented and unaccented /lag/ should have similar durations in all contexts (because they are both stressed). An explanation of the alternative type might be that since in Kohler's experiment the verb occurred at the end of the sentence, it was subject to final lengthening (there is ample evidence in the literature that final lengthening can apply to more than just the very last syllable in an utterance), and that during the phonetic implementation process this extra lengthening obscured or otherwise overrode any durational difference between accented and unaccented /lag/. This explanation assumes that unaccented /lag/ is not underlyingly stressed, and consequently would predict that in sentences where more material intervenes between the verb and the end of the utterance, such as Der ha'tte es aber stdndig [UMlagern/umLAGern] konnen, a durational difference should emerge more clearly. According to this explanation, the underlying representations would be pitch accent: stress: syllable:
X
X
X
X
X
X
umLAGern
X
X
UMlagern
The assumption here is that mapping from stress features in the underlying representation to the durational characteristics of the acoustic speech signal is context-dependent, whereby such contextual factors as final lengthening may obliterate any length differences that were related to the presence or absence of stress on a syllable. The sentence Der ha'tte es aber stdndig [UMlagern/umLAGern] konnen is an instance of a context where the above two explanations differ in their predictions concerning whether or not durational differences should emerge. In this sentence, the /lag/ is separated from the end of the utterance by three syllables, rather than only one. According to the former explanation, the /lag/ in ... UMlagern konnen 142
The separation of prosodies
and the /lag/ in ...umLAGern konnen should have similar durations because /lag/ is always underlying stressed. The latter explanation, which attributed the durational similarity in Kohler's productions to the interference of final lengthening, predicts that the /lag/ in ... UMlagern konnen should be shorter than the /lag/ in ... umLAGern konnen, because the end of the utterance is further removed. As an initial exploration, I recorded two adult native speakers of German speaking six repetitions of each of the above two sentences, and also six repetitions of each of Kohler's original two sentences (Er wird's wohl [UMlagern/ umLAGern]). I measured the durations of the /urn/ and /lag/ 2 syllables by means of synchronized graphics displays of the digitized waveforms and computergenerated spectrograms, combined with listening to portions of the speech signal. The results (averaged in milliseconds) for each speaker are in table 8.1. The top row of each table represents the durations of /urn/ and of /lag/ in these speakers' renditions of the same sentences as used by Kohler. If we compare these durations with those for /urn/ and /lag/ in the... umlagern konnen sentences, then we can see that there was indeed an influence of final lengthening, for the /urn/ as well as for the /lag/ (shown by the differences on the bottom line of each table). So in one respect, the second explanation was correct: the duration of /lag/ in Er wird\ wohl umlagern is influenced by the word being utterance-final. In another respect, also, a prediction of the second explanation was borne out: both speakers did show a durational difference between accented and unaccented /lag/ in the sentence where the effect of final lengthening was reduced (the... umLAGern konnen context). But the most important prediction of the second explanation proved incorrect: utterance-final lengthening did not obscure the durational difference between accented and unaccented /lag/ at all. The figures in the extreme righthand columns of each of tables 8.1a and 8.1b show that accented /lag/ was longer than unaccented /lag/ for both speakers in both sentence contexts. Even in the sentences that matched the ones Kohler used, where umlagern was utterance-final, /lag/ was considerably longer when it was accented. Final lengthening did not seem to obliterate anything. The first explanation did not fare much better, though. It assumed that /lag/ is always underlyingly stressed, and further assumed that this would consequently result in unaccented /lag/ having similar durations in both sentence contexts. Clearly this did not turn out to be true for these speakers. One possible explanation for the pervasive durational difference is that in German, the presence of a pitch accent adds somewhere between 20 and 40 ms. to a syllable's duration, on top of the contributions of other prosodic factors. The pattern in the data for the syllable /urn/ is quite puzzling. It also showed a durational difference in both sentence contexts, but much smaller in magnitude than the 87 millisecond difference in Kohler's own renditions of each of the two utterances. Moreover, for speaker BC the accented versions of the syllable were shorter than the unaccented versions, as shown by the two negative numbers in H3
KIM E. A. SILVERMAN
Table 8.1a Mean syllable durations in milliseconds; speaker jfS Mean durations of /um/ UMlagern umLAGern Utterance-final Nonfinal Difference (due to final lengthening)
165 148 17
127 118 9
difference (UM-um) 38 30
Mean durations of /lag/
Utterance-final Nonfinal Difference (due to final lengthening)
UMlagern
umLAGern
238 210 28
259 221 38
difference (LAG-lag) 21 11
Table 8.1b Mean syllable durations in milliseconds; speaker BC Mean durations of /um/ UMlagern umLAGern Utterance-final Nonfinal Difference (due to final lengthening)
162 146 16
174 163 11
difference (UM-um) -12 -17
Mean durations of /lag/
Utterance-final Nonfinal Difference (due to final lengthening)
UMlagern
umLAGern
249 217 32
274 241 33
144
difference (LAG-lag) 25 24
The separation of prosodies
the relevant column in table 8.1b.3 This pilot does not provide enough data for us to isolate the cause for this difference between the speakers. But despite this difference, the data cause us to reject explanations of the second type. Specifically, in both /urn/ and /lag/ the presence of pitch accent alters duration, over and above any concurrent implementation of final lengthening. Concerning the phonological representation of intonation in German, a closer inspection of the details of Kohler's results yields a few clues. Kohler distinguishes between three alternative accents which differ phonetically in how their F o peaks are aligned with their associated syllables, and are correspondingly assigned three different semantic glosses. This might lead us to posit that (i) the three accents share a common tonal representation (e.g. that they are all local rise-fall configurations), but (ii) they differ on a three-way alignment feature (e.g. before, during, or after the syllable rhyme). It is possible, however, to analyze the three accents described by Kohler in a different way, such that they do not share a common tonal structure. For example, in Pierrehumbert's (1980) framework, the early peak would be quite different from the middle and late peaks in its tonal makeup. The early peak would be a H + L * - where F o steps down onto the accented syllable from an immediately-preceding higher value - hence the /lag/ would contain an F o target at about the middle of the speaker's range. This would account for the knee-point evident between points B and C in the F o contour in Kohler's figure 7.3b. It would predict that the only reason that the F o contour continues to fall on the right-hand side of this knee-point is that the H + L* accent is followed by a low (L) tone. Ladd (1983) has presented some independent phonetic evidence for the existence of this H + L* accent in German. His figure 4.5 (Ladd 1983: 49) shows an F o contour for a sentence excised from a corpus of spontaneous German speech. The contour contains a sequence of two H + L* pitch accents, in each of which F o quickly steps down onto the accented syllable from a previous higher level, but only in the latter of the two is the accent followed by a L. The first of the two accents is therefore in precisely the context where a fall throughout the whole syllable would not be expected, because F o would be heading out of the first H + L* towards the start (H) of the second one. The F o contour indeed leaves the first syllable without falling, as soon as it reaches the target value (corresponding to the knee-point in Kohler's example).4 To complete the analysis of Kohler's three accents in Pierrehumbert's framework, the middle intonation peak would be a H* (or possibly, from Kohler's figure 7.1a, a L + H*), with a higher F o target associated with the accented syllable. The late intonation peak would represent a L* + H: a movement from a low target associated with the accented syllable up to an immediately-following high target. Perceptual evidence suggesting that (English-speaking) listeners treat this contrast more like a category distinction than a continuously variable feature of peak delay can be found in Pierrehumbert and Steele (to appear). i45
KIM E. A. SILVERMAN
This alternative phonological description brings a number of advantages. It accounts for the early peak having a shallower fall than the middle and late peaks, thereby supplying a phonological motivation for Kohler's conclusion that the shape of a peak has perceptual relevance. In addition, it describes more precisely which parts of the shape should matter in which contexts. This alternative phonological description also explains why the psychophysical curve representing the listener's perception of the 12-item continuum in Kohler's figure 7.6 is flatter in one case (the broken line) than in the other (the continuous line). The listeners were trying to decide the word's identity in the face of conflicting, phonologically incompatible cues. The lpc parameters were from Er wirtPs wohl UMlagern, and consequently /urn/ had all of the segmental acoustic/phonetic qualities of being stressed. Yet the F o contour was from a H + L* that associated with /lag/. The knee-point in that syllable survived the transfer to the new utterance (it can still be seen in Kohler's figure 7.5a), and was therefore inconsistent with stress being on the /urn/. In stimuli numbers 1 through to 5 of the continuum, in which Kohler shifted the peak to the left, he time-expanded the shallow fall, thereby making it even shallower and (crucially) maintaining F o in /lag/ higher than it could possibly be if the /urn/ had really borne the accent. What this all means is that the listeners were confronted with contradictory signals in a phonologicallyimpossible stimulus continuum: a H + L* accent linked to the syllable after the one that by all other characteristics was stressed. Perhaps not surprisingly, all along the continuum listeners were somewhat unsure which aspects of the signal to believe, and consequently the judgements were never 100% for one form or the other. From a psychological perspective, this and other aspects of Kohler's results argue that there is no single perceptual cue that listeners will unanimously attend to in a contradictory stimulus. Rather, it seems that the perceptual strategy is to attend to all dimensions of the signal, and use those that are unambiguous to help resolve the ambiguous ones. Taken together, I believe the results of Kohler's first experiment argue for a treatment of the phonology and phonetics of German intonation along the lines outlined above. Furthermore, the results provide evidence that listeners have implicit knowledge5 of the phonological inventory, the structural constraints, and detailed consequences of phonetic implementation, and that they use this knowledge during speech perception. Only if the combination of F o and the rest of the accompanying utterance are phonologically well-formed will judgments be clear and unanimous. F o is autonomous neither in speech production nor in speech perception.
8.2 Micro F o It has long been known that F o varies systematically according to the vowels and consonants that make up an utterance. Kohler calls these effects "micro F o ." It 146
The separation of prosodies
has also long been known that, at least in isolated CV monosyllables, listeners can use one aspect of micro F o - specifically the F o transition at consonantal release - as a cue to whether the consonant is voiced or voiceless. Kohler's experiments address some controversial issues in this area. The major issue concerns whether micro F o really matters very much. It has been argued that micro F o effects are largely an artifact of laboratory experimental procedures in which speakers read out lists of nonsense syllables, articulating as clearly as possible (e.g. Umeda 1981). In real-world connected speech, it is claimed, micro F o is expected to be at least largely diminished if not completely absent. As the label "micro" implies, these effects are thought to be rather small, relative to the much larger effects of sentence intonation (e.g. Ohala and Eukel 1978), They are also considered to be out of a speaker's control, and to have little or no relevance for the perception of intonation (de Pijper 1983). Even their relevance for perception of obstruent voicing has been challenged as being negligible in comparison with voice onset time (Abramson and Lisker 1985). Of course one way to investigate the importance of micro F o is to examine its production and perception in longer utterances than isolated monosyllables, and at the same time to take into account and control the concomitant intonational structure. Those (relatively few) studies which take this approach are gradually suggesting a quite different picture: that under the appropriate prosodic circumstances, segmental perturbations are anything but irrelevant.6 Ladd and Silverman (1984) compared vowel intrinsic F o (another major form of micro Fo) in lists of carrier sentences and in paragraphs of connected speech, and found that instead of being minimal, the effects are largest in the most prosodically salient words of connected speech. In some cases they were larger than differences in F o that would arise from intonational phenomena such as downstep applying to a pitch accent. Using multi-accent sentences, Reinholt Petersen (1979) has found that the magnitude of vowel effects is larger in accented than unaccented syllables. Similarly, Steele (1985) has found these effects on accented syllables to vary according to the prominence of the pitch accent. In order to investigate the perceptual importance of segmental influences on F o , Silverman (1985) asked listeners to judge the intended prosodic structure of meaningful utterances in which different vowels had been substituted (such as They only feast before fasting versus They only fast before feasting), and found that listeners calculate and then factor out intrinsic vowel F o when recovering the intended intonation from the acoustic speech signal. Kohler's current results contribute to this emerging picture. Two of his experiments show that both in German and in English sentences listeners will use the F o transition into a consonant as a cue to its voicing. Microprosodic effects before obstruents are themselves a topic of contradictory claims. Some researchers deny that they have any relationship to obstruent voicing, while others have even claimed that they do not exist at all. In a production study (Silverman 1984)
KIM E. A. SILVERMAN
I measured F o before and after obstruents in a corpus of utterances in which I attempted to control, rather than average out, the global intonation. A consistent pattern in the results was that during the last 60 or so ms. before all obstruents, F o was sharply lowered relative to the underlying intonation. One aspect of the results that was most relevant to Kohler's experiments (in this volume) was that the amount by which F o was lowered depended on the voicing of the following obstruent: it was lowered by about 10 Hz before voiced obstruents, but only by about 6 Hz or less before voiceless obstruents. This difference is quite small, but pervasive (in all of the contexts measured), statistically significant, and consistent across speakers. Preconsonantal perturbations are the smallest of all the segmental influences on F o . Kohler's experiments provide evidence that listeners can make use of even these small preconsonantal F o differences during speech perception, and thereby show yet another context, in addition to those mentioned above, in which microprosodic effects have perceptual significance. The pervasiveness and perceptual importance of segmental influences on Fo present an important question: how do listeners know whether to attribute a particular F o level at a particular place in an utterance to the underlying intonation or to the phonetic segmental structure? This is precisely the issue raised by the results of Kohler's last experiment: the robust effect of microprosody on listeners' perception of voicing {widen versus whiten), which Kohler had replicated in a number of previous experiments, suddenly disappeared. This result surprised me. Kohler explains it by reference to the underlying prosody: in the widen/whiten sentences there is a rise-fall intonation, which means that there is an F o peak late in the vowel - just before the t/d and consequently right where the pre-stop microprosodic influence would be. In this context, he reasons, listeners cannot separate the microprosodic effects from the macroprosody because the latter obliterates the former. If I understand the argument correctly, it assumes that listeners can only extract one type of information from F o at a time. They can use F o to infer either segmental information or suprasegmental information, but not both at once. When a vowel contains an intonational target, microprosodic effects will be ignored. There are three reasons why I would not have expected this. Firstly, in the production study that I mentioned above, one of the speakers (IW) produced an intonational peak (a prenuclear H* accent) at the end of a vowel in the test words, in a similar position to the F o peak referred to by Kohler, and yet in the same place this speaker also showed a clear effect of the following obstruent's voicing on the F o contour. Since such a regularity occurs in production, we would expect listeners to be able to capitalize on it during perception. This brings us to the second reason: it is not computationally impossible to decompose an F o contour into several parallel contributory sources. The presence of a rise-fall pitch accent does not preclude the possibility of estimating the influence of the upcoming consonant on the pre-closure F o contour. Listeners do 148
The separation of prosodies
"know" what that influence would be, as amply evidenced by the results of Kohler's previous experiments. Computationally this is sufficient information to subtract them out of the F o values, and thereby to identify which influences are most likely present. For example, it is more likely that those stimuli where F o just before the consonant fell from 123 Hz to 85 Hz (i.e. the " level + falling " pattern) contained the influence of a following /d/, whereas those stimuli in which F o stayed high at around 122 Hz (i.e. the "level" pattern) were more likely to contain a postvocalic / t / . Since listeners heard 21 stimuli they had ample opportunity to normalize to the speakers' voice range, and so the calculation is not underspecified. The argument here is basically that if listeners can make use of something, then they will. A third reason to question the assumption that listeners cannot compute both macro- and microprosodic influences at the same location comes from experimental results in which they compute one in order to compute the other. I have already mentioned the perceptual experiment in which I asked listeners to judge the intended prosodic structure from utterances in which I synthetically varied the vowel identity. More specifically, the task was to judge which of the two accented words was meant by the speaker to be the most important. This is conveyed by the relative prominence of the two pitch accents, which means (in this case) the relative height of the F o targets in the two vowels. This is of course part of the macroprosodic pattern. Yet the same targets also contained strong microprosodic components (the intrinsic pitch of the vowels), which listeners identified and factored out in order to calculate the underlying macroprosodic relationship. If they did it with vowel effects, then why not with consonant effects as well? In another experiment (Silverman 1986) I asked listeners to judge the voicing ("pa" versus "ba") of the last syllable in a trisyllabic word. I synthesized a number of versions of stimuli by independently manipulating the F o transition at the release of the stop (rising, level, or falling) and the overall intonation contour throughout the word (L* H- H % vs. H* L- L%). The results showed that F o cues to voicing depend on the underlying intonation. Listeners appear to track the intonation contour in an utterance, and on that basis predict where F o should be if it were being driven solely by the underlying tonal structure. They compute the difference between the predicted and the actual F o values, and in this way they factor out the macroprosodic influences in the F o contour in order to recover the microprosodic effects. These they use to help identify segmental features (such as stop voicing). Thus not only can they compute both macro- and microprosodic components in the same section of an F o contour, but they must compute the former in order to be able to use the latter. These are my reasons for being surprised at the results of Kohler's last experiment. I am now left with the burden of explaining them, and here I can only offer a suggestion. Perhaps the absence of any microprosodic effect is not directly due to the tonal structure (the rise-fall accent), but rather to the stress structure. H9
KIM E. A. SILVERMAN
In the (1984) production study I found that the magnitude of F o perturbations before (as well as after) a consonant depends on the amount of stress on the syllable for which that consonant is the onset. Perturbations are largest around a consonant that begins a stressed, accented syllable; they are reduced if that syllable has full vowel quality but no accent or stress; and they are smallest if that syllable is completely reduced. Kohler's stimuli represent the latter case: the stop in widen and whiten begins a reduced syllable. This is where the microprosodic influences will be least, and so perhaps our speech perception strategies do not place much weight on them in this prosodic context. However, this cannot be a complete explanation, for if it were then there would also have been no effect in the Lighton-versus-Lyden experiment. In summary, Kohler's latter three experiments provide further evidence that segmental influences on F o are indeed perceptually relevant, and therefore should form an important part of our models of phonetic implementation on the one hand, and speech perception on the other. At the same time they raise new questions, and along with Kohler's first experiment they remind us that listeners, as well as phonologists, are concerned with the mapping between abstract representations and the acoustic speech signal.
Notes 1 This view underpins much recent work, such as Liberman and Prince (1977), Ladd (1980), and Pierrehumbert (1980), although it goes back at least to Bolinger (1972). For a discussion of the relationship between stress and accent across different language types, see Beckman (1986). 2 I measured up to the onset of maximum closure for / g / in /lag/, and thus the duration of the closure is not included in the data. This seemed prudent because of the ambisyllabic nature of the intervocalic stop. 3 The direction of this difference seemed quite counterintuitive, and so I remeasured all of the utterances using different criteria for the onset of the /u/. All of the criteria onset of globalized voicing, onset of regular periodicity, and the point at which the second formant reached its target - yielded the same pattern, and the same difference between the two speakers. 4 Ladd's phonological treatment of the accent differs in some ways from H + L*, but is not inconsistent with the interpolation to the next accent that is discussed here. He, too, classes the two accents as being the same, even though F o only falls to the bottom of the speakers' range in the second of them. 5 This is the sense described by Cutler (this volume) as weak psychological reality. 6 The reader is referred to Silverman (1987) for a review of the importance of segmental influences on F o in the production and perception of speech. References Abramson, A. S. and L. Lisker. 1985. Relative power of cues: F o shift versus voice timing. In V. Fromkin (ed.) Phonetic Linguistics. New York: Academic Press. 150
The separation of prosodies
Beckman, M. E. 1986. Stress and Non-Stress Accent. Dordrecht: Foris. Bolinger, D. L. (ed.) 1970. Intonation: Selected Readings. Harmondsworth: Penguin. Cutler, A. (this volume) From performance to phonology. Kohler, K. J. (this volume) Macro and micro F o in the synthesis of intonation. Ladd,D.R.198O. The Structure of Intonational Meaning: Evidence from English. Bloomington: Indiana University Press. 1983. Peak features and overall slope. In A. Cutler and D. R. Ladd (eds.) Prosody: Models and Measurements Berlin: Springer Verlag. Ladd, D. R. and K. E. A. Silverman. 1984. Intrinsic pitch of vowels in connected speech. Phonetica 41: 31-40. Liberman, M. Y. and A. Prince. 1977. On stress and linguistic rhythm. Linguistic Inquiry 8: 249-336. Ohala, J. and B. Eukel. 1978. Explaining the intrinsic pitch of vowels. Report of the Phonology Laboratory, University of California in Berkeley 2: 118-125. Pierrehumbert, J. B. 1980. The phonology and phonetics of English intonation. Ph.D. dissertation, MIT. Pierrehumbert, J. B. and S. A. Steele. To appear. Tonal alignment and category in English intonation. Phonetica. de Pijper, J. R. 1983. Modelling British English Intonation. Dordrecht: Foris. Reinholt Petersen, N. 1979. Variation in inherent F o level differences between vowels as a function of position in the utterance and in the stress group. Annual Report of the Institute of Phonetics, University of Copenhagen 13: 27-59. Scherer, K. R., D. R. Ladd, and K. E. A. Silverman. 1984. Vocal cues to speaker affect: testing two models, Journal of the Acoustical Society of America 76: 1346-1356. Silverman, K. E. A. 1984. F o perturbations as a function of voicing of pre-vocalic and post-vocalic stops and fricatives, and of syllable stress. In E. Lawrence (ed.) Proceedings of the Autumn Conference of the Institute of Acoustics 6: 445—452. 1985. Perception of intonation depends on vowel intrinsic pitch. Journal of the Acoustical Society of America Supplement 1, 79: S38. 1986. Segmental perturbations depend on intonation: the case of the rise after voiced stops. Phonetica (Special issue on "Prosodic Cues to Segments") 43: 76-91. 1987. The structure and processing of fundamental frequency contours. Ph.D. dissertation, Cambridge University. Steele, S. A. 1985. Vowel intrinsic fundamental frequency in prosodic context. Ph.D. dissertation, University of Texas at Dallas. Umeda, N. 1981. Influence of segmental factors on fundamental frequency in fluent speech. Journal of the Acoustical Society of America 70: 350-355.
9 Lengthenings and shortenings and the nature of pro so die constituency MARY E. BECKMAN AND JAN EDWARDS
9.1
Introduction
There are two durational effects often cited as evidence for very different models of prosodic constituency in English. The first is known variously as "final lengthening," "pre-boundary lengthening," or "pre-pausal lengthening" (e.g. Oiler 1973; Klatt 1975; Cooper and Paccia-Cooper 1980). As the last name suggests, this effect is usually interpreted as a durational correlate of the sort of disjuncture that can cause a momentary cessation of speech. The second durational effect has no similar unified set of labels, but it might be called "stresstimed shortening" since it belongs to a class of effects that have been interpreted as indications of a tendency toward isochronous spacing of prosodically strong syllables; a stressed syllable in a polysyllabic word or stress foot is compressed in order to make the overall duration of its word or stress foot closer to that of a contrasting monosyllable (e.g. Huggins 1975; Fowler 1977). The two effects are similar in that both involve an apparent adjustment of syllable durations which is dependent on some notion of constituency. In the first case, a syllable is lengthened because of its position near to the edge of some constituent, and in the second it is shortened because of the length of a constituent defined by adjacent peaks at some prosodic level. The two effects differ radically, however, in the type of constituency that is implicit in their interpretations. Final lengthening implies a constituent that has well-defined edges; the lengthening occurs before a boundary that could be followed by a pause. But this constituent need not have a phonological head; its internal prosodic structure could be completely flat. Stress-timed shortening, by contrast, implies a constituent headed by a phonological prominence at some prosodic level; the shortening occurs to equalize durations for some unit, such as the stress foot, which is defined by the necessary occurrence of some sort of prominence peak. But it says nothing necessarily about edges; the unit need not be a constituent at all in the sense of having clearly identifiable boundaries. Thus, in theory, final lengthening and stress-timed shortening are radically different durational effects, requiring different 152
Lengthenings and shortenings and the nature of prosodic constituency
sorts of surface representations for their different sorts of constituents, as suggested in (1): (1)
a. final lengthening on elements marked with "f ":
[xx] [xx] [x] [xx]
T
t
T T
or
b. stress-timed shortening on elements marked w i t h " f " : A
A
A
X
A
/ x \ x / x \ x / x \ /xVx/ T T T
o r
X
x T
x
XX
x l
x
x
x
x
I
In practice, however, the effects are not so easy to differentiate. Is a particular stressed syllable which is initial to a polysyllabic word shorter than a matched monosyllable because it is compensating for the syllables following it in the stress foot that it heads ? Or is it shorter because it has not undergone final lengthening ? Conversely, is a particular constituent-final syllable longer than a matching nonfinal syllable because of a local deceleration at the constituent boundary ? Or is it longer because it has not been shortened by syllables intervening before the next constituent's head ? It is difficult to design experiments in which one interpretation can be distinguished from the other, and in many experiments reported in the literature, the two effects are in fact confounded. Despite this difficulty, the two effects must be separated, because the different models of constituency implied by them have played a central role in the development of metrical theory. For example, the major differences between Selkirk's earlier tree-based account of stress patterns (Selkirk 1980, 1981) and her later grid-only account (Selkirk 1984) can be characterized largely in terms of the different notions of prosodic constituency that are assumed. The tree-based account assumes a stress foot which has clear edges and which constitutes a lowerlevel constituent in the same phrasal tree as the intonational phrase, whereas the grid-only account represents stress by an independent hierarchy of rhythmic beats, which has only a very indirect relationship to intonational constituency via the addition of "silent beats" (interpreted as pause or lengthening) at syntactic and other phrasal boundaries, including intonational phrase breaks. The nature of prosodic constituency is thus a crucial question in determining the possible representations of stress and of its relationships to phrasing and intonation. And the experimental basis for an answer to this question would depend on a better understanding of pre-boundary lengthening and stress-timed shortening. In this paper, then, we will explore these two interpretations as they relate to metrical representations of prosodic structure in English. We will describe a series of experiments in which we have attempted to distinguish these effects and to locate them at some level of the prosodic hierarchy. We feel that a large part of
MARY E. BECKMAN AND JAN EDWARDS
the difficulty with earlier experiments stems from a mistaken assumption that questions about the phonetic characterization of stress and its relationship to phrasing and intonation can be addressed pre-theoretically. The unexamined phonological theory that is only implicit in the experimental design theri^ confounds the different possible interpretations. Before proceeding to the description of the experiments, therefore, we will first lay out our assumptions about surface phonological representation and the relationship between that representation and its phonetic interpretation.
9.2
Phonology and syntax
A first assumption that we make is that the effects under consideration do have something to do with prosody and are not direct reflections of syntactic structure. In the case of stress-timed shortening, this assumption is built into the description of the effect. Stresses are prosodic events after all. In the case of pre-boundary lengthening, however, the assumption is less obvious. There is nothing in the effect as described thus far that necessitates that the boundary be the edge of a prosodic unit, and, in fact, the effect has often been discussed in terms of syntactic constituency alone. Thus, the title of Klatt 1975 states that "Vowel lengthening is syntactically determined in a connected discourse." Similarly, Cooper and Paccia-Cooper (1980) include final lengthening in their catalogue of effects that reflect syntactic boundary strength. Our principal concern, by contrast, is to see whether final lengthening can provide evidence for any sort of hierarchy of phonological boundary strength. Hence, we begin with a bias against attributing any part of the effect to an influence from syntax on phonetics that proceeds directly, without the mediation of phonological structure. This assumption is a bias, however, and not an axiom. We do not reject a priori the possibility that certain phonetic effects might be direct analogue encodings of syntactic or other nonphonological structure. For example, we think that some aspects of phrasal prominence relationships should not be assigned phonological status, but should instead be relegated to the phonetics of pitch scaling (see section 3). Similarly, work by Hirschberg and Pierrehumbert (1986) suggests that final lowering has no phonological status in the organization of extended utterances. They report a preliminary experiment in which they created the synthesized responses of a computer in an interactive tutorial program. An analysis of the computer's introductory message showed that appropriate amounts of final lowering can be predicted from the discourse topic structure, making it unlikely that this effect is a mark of a prosodic constituent above the intonational phrase. We do not rule out the possibility that aspects of pre-boundary lengthening likewise can encode focus and other informational organization directly. We merely question the wisdom of not looking first at possible mediation by an independently motivated phonological organization.
Lengthenings and shortenings and the nature of prosodic constituency
Furthermore, it is probably safe to rule out any account of pre-boundary lengthening that attributes all of the effect to syntactic phrasal structure. If Gee and Grosjean's (1983) discussion of the related phenomenon of pause length patterns is any guide, such seemingly extra-grammatical factors as speech rate and constituent size at least do affect temporal boundary cues. If these two factors exercise their effect on pre-boundary lengthening by influencing grammatical structure, it seems safe to assume that it is phonological structure that they influence and not syntactic structure. Or, if rate can affect phonetic cues for constituent boundaries directly without affecting grammatical structure, as suggested by Selkirk (1984), then constituent size at least must exercise its influence with some reference to phonological constituency. Moreover, since the intonational system gives us independent evidence for some phonological constituents, we must at least consider the possibility that the durational effects also reflect these constituents. And again, the practical difficulty of separating preboundary lengthening from stress-timed shortening guarantees that, if we ignore prosodic structure in designing an experiment, we vitiate the explanatory power of its results. The next section discusses the types of phonological constituent motivated by intonation and describes our understanding of how these constituents relate to the prosodic hierarchy of stresses.
9.3
Phonology and phonetics
We consider the prosodic structure of an utterance to be a hierarchical arrangement of various prominence-lending phonological properties. This arrangement can be represented by a metrical grid with suitable bracketings at any level that also has constituents with phonologically marked edges. The grid in (2), for example, represents the phrase phonological structure as it might be said in isolation, with an intonation typical of citation forms. As the example makes clear, we adopt the intonational analysis and notation of Pierrehumbert 1980 as modified and extended in Anderson et al. (1984), Liberman & Pierrehumbert (1984), and other later work. We emphasize further that our concern is entirely with the surface properties depicted in such representations; we have nothing to say about derivation or about underlying prosodic patterns abstracted away from the surface properties that realize them in an actual utterance with a particular intonation contour. (2)
[ x xx x x x x x
x x x xx
]
nuclear accent, boundary tone accent stress syllable
phonological structure
I
(
H*
H* L L%
155
MARY E. BECKMAN AND JAN EDWARDS
Note that we choose to represent the hierarchical arrangement as in (2) not because we think a grid will prove to be a better model than a tree would be, but rather because we know little about surface phonological groupings below the intonational phrase, and we do not want to commit ourselves prematurely to a representation that shows constituent boundaries about which we are unsure. With the grid, by contrast, we can show only the prosodic heads, about which we have more information. The prominence peaks at each of the levels of the grid in (2) have well-documented phonological markings. The lowest level of this grid consists of seven local sonority peaks defining events called "syllables." Three of these syllables contain unreduced vowels, and are quantitatively longer and louder than the others, properties which define another level of events called "stresses." Two of these stressed syllables are autosegmentally associated to certain prominence-lending tonal configurations in the intonation contour, the two H* "pitch accents"; the association to a pitch accent creates another level of prosodic strength, that of "accented syllables." The last pitch accent is followed by a L tone that is not associated with any particular syllable. This is the "phrase accent". The falling tonal pattern created by the juxtaposition of the phrase accent helps to give the syllable associated to the last pitch accent a special prominence known as "nuclear stress" or "sentence stress." (The last accent itself is designated the "nuclear accent.") There is also a L% "boundary tone" aligned to the edge of the phrase after the phrase accent. This boundary tone phonologically marks the end of a constituent called the "intonational phrase." The grid shown in (2) constitutes a particular hypothesis about prosodic organization in English; it amounts to a claim that these four levels and only these four levels are to be represented in the surface phonology. This particular hypothesis could be wrong on two points. First, it is possible that the property of being qualitatively longer and louder must be differentiated from the property of having an unreduced vowel as the mark of a separate distinct level of stress (see Beckman 1986 for a review of the relevant experimental literature). Second, it is possible that the phrase accent delimits a separate intonational constituent that is smaller than the intonational phrase (see Beckman and Pierrehumbert 1986). * We choose to show just the four grid levels in (2), however, because they are characterized by clearly identifiable phonological properties that are already well documented in the phonological and experimental literature. The particular hypothesis represented in (2) thus also exemplifies a criterion for choosing from a general class of models that fit our notion of what constitutes an adequate phonological representation of intonation and other prosodic structure. The rationale for this criterion stems from the fact that phonological properties are categorical and qualitative. They single out discrete phonetic properties or they abstract away from continuously variable phonetic properties so as to create discrete differences that can mark organizational structure and paradigmatic contrast. This general characterization of phonological properties is inherent in 156
Lengthenings and shortenings and the nature of prosodic constituency
conventional systems of phonological representation, which provide discrete segmental symbols even for nonsegmental or syntagmatic properties (e.g. nodes and branches in a tree or rows and columns of x's in a grid). Our criterion for choosing grid levels is in keeping with such conventional modes of phonological representation; we choose to include as levels of the prosodic hierarchy only those aspects of prosodic prominence that can be defined in terms of qualitative phonological properties. These qualitative properties impart an absolute prominence to each beat. For example, an accented syllable is absolutely more prominent than any stressed syllable not associated to a pitch accent. These sorts of absolute categorical differences we represent in the grid. Of course, not all prominence relationships are absolute. For example, one nonnuclear accented syllable can be relatively more prominent than another by virtue of the relative pitches of the associated pitch accents within a given pitch range. However, we do not represent such merely quantitative distinctions by adding layers of beats to the grid, since to do so would be to accord phonological status as a categorical property to a continuous iconic symbolism that is better represented in the phonetic component, where phonological properties are interpreted as motoric or physical patterns for producing and perceiving speech. A second noteworthy aspect of the grid in (2) is that two of the four layers are defined in terms of intonational properties. This reflects the important role played by the intonation contour in organizing an English utterance, an importance earlier recognized in Selkirk (1984). We must, however, differentiate our notion of the role of the pitch accent as a prominence-lending property at the level just below the sentence stress from Selkirk's notion that pitch accents add beats to the grid. Selkirk's grid is a rhythmic structure, with a vaguely specified phonetic interpretation in timing patterns. At some level of the grid there are added beats initially triggered by pitch accents, but the phonetic interpretation of these beats is apparently independent of the interpretation of the pitch accents; hence Selkirk's claim that "intonation comes first," before the grid. In our grid, by contrast, the beat is not triggered by the accent but rather represents the prominence of the accent; the accented syllable is made prominent by the interpretation of the associated accent in the fundamental frequency pattern of the utterance whether or not the beat has any consequence for the temporal spacing of accented syllables. For our grid, the claim that "intonation comes first" makes little sense, like the chicken-or-egg paradox. Returning then to the problem of stressed-syllable shortening versus preboundary lengthening, we can characterize the two types of prosodic constituency implicit in these two interpretations as two different hypotheses about the interpretation of the prosodic hierarchy in timing patterns. Stressed-syllable shortening implies a timing unit defined by the intervals between beats at some level of the hierarchy. If this unit has real phonological status (i.e. if the effect were i57
MARY E. BECKMAN AND JAN EDWARDS
not better relegated to phonetic interpretation along with other details of phonetic timing such as the different durations of vowels before voiced and voiceless obstruents), then it should be represented in the grid by adjusting the spacing of columns so as to regularize the distances between columns having beats at this level. Note that this representation is inherently neutral concerning the affiliation of material in intervening columns; material aligned to an intervening beat at the next lower-level could belong to the same constituent as the earlier beat at the higher level or it could belong with the material aligned to the following higher beat. Pre-boundary lengthening, on the other hand, implies a constituent that has clear edges delimiting affiliated from nonaffiliated material. If the constituent belongs to the prosodic hierarchy, it will be represented in the grid by bracketing, just as the edges of the intonational phrase are represented in (2). Pre-boundary lengthening would then be the phonological boundary mark (or one of several boundary marks) for that constituent. Of course, putting the brackets in the grid makes sense only if the definition of the constituent refers directly to the phonological properties defining some level in the grid - for example, if each successive constituent is headed by a prosodic peak at some grid level, as each intonational phrase is headed by a sentence stress in (2). If the constituent makes no reference to the grid, that fact might indicate that it belongs to some other phonological structure that is organized on principles independent of those building the hierarchy of prosodic prominences. In the rest of this paper, we will describe a series of experiments that we have done in an attempt to differentiate among the two effects and among their various implications for the representation of prosodic structure in the grid. We begin our investigation by noting that only the highest level of the grid in (2) corresponds to any independently documented constituent. Here there is a boundary tone to mark the edges of units headed by the sentence stresses, whereas every other level only has the phonological event marking the prominence peak. An attractive hypothesis, therefore, is that phrase-final lengthening can be differentiated from stress-timed shortening by a dependence on the intonational phrasing of an utterance. That is, we hypothesize that final lengthening is the durational correlate of the boundary tone, and affects the syllables preceding every intonational phrase boundary. Conversely, true final lengthening must be limited to these syllables; any effect that occurs medially to an intonational phrase must be something else, perhaps a stress-timed shortening that adjusts the spacing between beats at some lower level of the grid.
9.4
Experiment 1 - intonational phrasing and final lengthening
We tested this hypothesis with the set of sentences shown in table 9.1. These sentences contrasted two phrasings in which the end of the target word either did or did not coincide with an obligatory intonational phrase boundary. The target 158
Lengthenings and shortenings and the nature of prosodic constituency Table 9.1 Corpus for experiment 1 - intonational phrasing a. obligatory intonational phrase break 1. Pop, opposing the question strongly, refused an answer to it. 2. Poppa, posing the question strongly, demanded an answer to it. b. intonational phrase break unlikely 1. Pop opposed the question strongly, and so refused to answer it. 2. Poppa posed the question strongly, and then refused to answer it. word was either pop or poppa, with an identical initial stressed target syllable [pa] which either is or is not word-final. Different verbs followed the target words in order to maintain a constant inter-stress interval. Five native speakers of American English read five tokens of each sentence from a randomized list at each of three self-selected speaking rates. We measured the durations of the segments in the first three syllables, using a digital waveform editor and standard measurement criteria to define the onset and offset of the vowels. All five subjects showed a large and highly significant final lengthening effect at the intonational phrase boundary. This result is illustrated in figure 9.1a, which plots the duration of the [a] of the first syllable against the duration of the [e] of the "question" for tokens of the sentences in table 9.1a produced by subject KAJ. (The duration of the [e] is used as an indicator of overall speaking rate.) The open circles in the plot are for tokens with Pop, opposing..., where the [a] is in a phrasefinal syllable, and the filled circles are for the tokens with Poppa, posing..., where another syllable precedes before the intonational phrase boundary. The solid and dashed lines are regression curves fitted to the two sets of data points. For every speaking rate (i.e. for every value of [e] along the x-axis) the data points for pop lie above the ones for poppa and there is a clear separation between the regression curves. Figure lb shows the analogous effects for the [9] in the second syllable of the target phrase. Again the phrase-final [3] in Poppa, posing... is consistently longer than the non-final [3] in Pop, opposing... Results for the other four subjects were similar, as illustrated in figures 9.2 and 9.3, which give the mean durations of [a] and [3] in these sentences averaged over all five tokens at each of the three speaking rates for each of the five subjects. These results support the first clause of our initial hypothesis, showing that final lengthening occurs at intonational phrase boundaries. However, the results for the other sentences contradict the second clause of the hypothesis, that all apparent final lengthenings be limited to intonational phrase boundaries. The subjects had a longer [a] in pop and a longer [3] in poppa even in the sentences in table 9.1b, where the target words were not followed by an intonational phrase boundary. These differences were much smaller, but they were still significant for many of the subjects at many rates. This is illustrated in figures 9.4a and 9.4b, which plot the duration of the [a] and [3] against the duration of
MARY E. BECKMAN AND JAN EDWARDS
O
pop opposing
9
poppa posing
100
125
150
175
100
125
150
175
300
250 -
200 -
150-
100
|
c o
duration of [e] in "question" (ms) Figure 9.1 Target vowel duration plotted against reference vowel duration in all tokens produced by speaker KAJ of experiment 1 sentences containing an obligatory phrase break. The target vowels in the plots are the [a] of the first syllable (top) and the [a] of the second syllable (bottom), and the reference vowel is the [e] of question. 160
Lengthenings and shortenings and the nature of prosodic constituency
a a
&
6 I o a o Q_ c
300
300
250 -
250
200-
- 200
150 -
- 150
100 fast 450 <•
JRE
400
—» CO
350
ion
O
300
03
"D C CD CD
250 200
E
)\
normal slow
I
-A
1 i i-
fast
normal
slow
fast
normal
slow
100
S
LAW
• 5 y
-
s
150 100 fast
normal slow
fast
normal
slow
Figure 9.2 Mean durations of the vowel [a] in the first syllable of the target sequences in the experiment 1 sentences containing an obligatory phrase break after pop or poppa. Means are averaged over 5 tokens at each rate for each speaker. Error bars show plusor-minus one standard deviation; points with no apparent error bars have standard deviations smaller than the size of the plotting character.
the [e] for these sentences produced by subject KAJ. The data points overlap considerably more than in figure 9.1, especially at the fast and normal rates, but the two regression curves are nevertheless clearly separated along most of their lengths. Similar small differences are evident in figure 9.5, which summarizes the mean durations of the [a] and [9] in these sentences for each of the five subjects. Note that the differences between the means for the pop and poppa are in the same direction for all speakers, although they are not evident at all rates. For example, there is no significant difference except at the slow rate for subjects JRE and LAW. This result suggests that there is a smaller effect in these sentences which is different from the substantial phrase-final lengthening at the intonational phrase boundary. Furthermore, the effect cannot be interpreted in terms of a tendency to 161
MARY E. BECKMAN AND JAN EDWARDS
300
300
200-
- 200
100-
- 100 O •D
fast
normal
slow
fast
normal
slow
fast
normal
slow
CQ
300
I
o 200 -
O
100-
fast
normal
slow
fast
normal
slow
^
Figure 9.3 Mean durations of the vowel [a] in the second syllable of the target sequences in the experiment 1 sentences containing an obligatory phrase break after pop or poppa. Sample sizes and error bars for each point as in figure 9.2.
make isochronous inter-stress intervals, because the corpus was designed expressly to always have exactly one unstressed syllable between the stress on the target syllable in the noun and the stress in the following verb: (3)
x X
X X
X
X
Pop opposed the question.
X
X
X
X
X
X
Poppa posed the question.
Thus, the smaller difference in the sentences where there was no medial intonational phrase break must be indicative of something other than stressedsyllable shortening; it must be governed by the occurrence of some sort of constituent edge after the target noun. We hypothesized that it might be some sort of final lengthening for a smaller prosodic unit below the level of the intonational phrase. We labeled the effect "word-final" (as opposed to "phrase-final") 162
Lengthenings and shortenings and the nature of prosodic constituency O
POP opposed
0
poppa posed
300
250 -
c
o
200 -
CO T3
150-
100 100
125
150
175
100
125
150
175
300
250 -
•o
duration of [e] in "question" (ms.) Figure 9.4 Target vowel duration plotted against reference vowel duration in experiment 1 sentences in which an obligatory phrase break is unlikely following the pop ox poppa. Target and reference vowels and speaker as in figure 9.1. 163
MARY E. BECKMAN AND JAN EDWARDS
250
I
Q_ C
100
5' o
o I o
o fast normal slow fast normal slow fast normal slow
-Q O (A CD Q.
o "O O •D
•a
o
fast normal slow fast normal slow Figure 9.5 Mean durations of the vowels in the target syllables in tokens of the experiment 1 sentences with no obligatory phrase break after pop or poppa. Means averaged over 5 tokens at each rate for each speaker.
^ <*,
lengthening and did two further experiments in order to locate it more precisely within the phonological framework outlined above.
9.5 Experiment 2 - word-final lengthening and the accentual phrase The first possibility we considered is that word-final lengthening is a boundary mark at the level of the pitch accent. That is, we posited the existence of "accentual phrases," phonological constituents which are headed by accented syllables and bounded by word-final lengthening, as shown in (4): (4)
[ [
x XX X
][
x
]
x
]
nuclear accent/intonational phrase accent/accentual phrase
X X X X X
X
X
phonological structure H*
H* L L% 164
Lengthenings and shortenings and the nature of prosodic constituency
If the domain of word-final lengthening is a prosodic constituent, this seemed a plausible unit to propose, because accent patterns belong to the intonation and thus are part of the post-lexical phrasal phonology, whereas stress patterns are largely specified in the lexicon. Also, speakers may produce more prenuclear pitch accents in slower renditions of a given sentence, a tendency which could explain the apparent dependency on rate in subjects such as JRE and LAW. Table 9.2 Corpus for experiment 2 — accentual phrasing a. postnuclear 1. Q. Did his dad pose a problem as far as their getting married? A. HER poppa posed a problem. 2. Q. Was his dad opposed to their getting married? A. HER pop opposed the marrige. b. prenuclear unaccented 1. Q. Was his dad involved in solving their problems? A. HER poppa POSED a problem.
2. Q. Did his dad feel strongly about their marriage? A. HER pop OPPOSED the marriage.
c. nuclear 1. Q. Was her mama a problem about the wedding? A. Her POPPA posed a problem.
2. Q. Was her mom against their getting married? A. Her POP opposed the marriage.
Table 9.2 lists the corpus of sentences that we designed to test for the accentual phrase. Before describing the predictions involved in this corpus, we first note that not every word in an intonation contour will necessarily have an accent. If accentual phrases are phonological constituents, then the status of words with no accents must be addressed. It is possible that such a word belongs to the same accentual phrase as the preceding accented word, as shown in (5a). Or it could be that they are "extra-metrical" to the prosodic units at this level, and are unafliliated to any accentual phrase, as in (5b): (5)
a. [ [ x X X
x x
] [ X X
] ]
X X
b. [ [ x]
X X
X
X
X X
X
HER pop OPPOSED the marriage.
X X
]
X X
X
X
X
X
X
HER pop OPPOSED the marriage.
II H*
x x ]
[
II H*
L
L%
H*
H*
L
L%
We designed experiment 2 in order to test the hypothesis that the domain of wordfinal lengthening is the accentual phrase and also to determine the status of words with no accents. The same five subjects and one more subject read ten tokens of each sentence at each of three rates. The sentences had the same target contrast 165
MARY E. BECKMAN AND JAN EDWARDS
between pop opposed and poppa posed that was used in experiment 1. These target phrases were placed in contexts to produce a three-way contrast in accent placement: (1) post-nuclear position, in which the target word immediately followed the syllable with nuclear accent; (2) prenuclear position, in which the target word was unaccented and immediately preceded the word with nuclear accent; and (3) nuclear position, in which the target word received nuclear accent. The three accent placements were chosen for the following two reasons. First, postnuclear position tested the plausibility of the unit in general. If the accentual phrase is the domain of the phrase-medial effect in the first experiment, there should be no difference in the duration of the [a] in the contrasting target phrases whatever the status of unaccented material, since there will never be an accentual phrase break after the target noun: (6)
a. [ x
] ]
X
X
X
X
X
X
Everything in some accentual phrase
X X
X
X
X
HER pop opposed the marriage. HER poppa posed a problem. L% b. [ x [x]
] Unaccented words extra-metrical
X
X
X
X
X X
X X
X
X
X
HER pop opposed the marriage. HER poppa posed a problem.
L%
Second, if the prediction in (6) is borne out, then the other two accent placements should resolve the question of how to treat unaccented material.2 Specifically, in nuclear position, there should be word-final lengthening if and only if unaccented material is extra-metrical to the accentual phrase, as shown in (7b) as contrasted with (7a) : (7)
]
a. [ X
X
X
X
X X
X X
X
X
X
Her POP opposed the marriage. Her POPPA posed a problem.
I H*
L
L%
166
Lengthenings and shortenings and the nature of prosodic constituency (7)
b. [ x [ x ] X
]
X
X
X
X
X
vs. [ [
X X
X
X
x x X
] ]
X
X
X
X
Her POP opposed the marriage.
X
X X
X
X
X
X
Her POPPA posed a problem.
I
I
H*
L
L%
H*
L
L%
In prenuclear position, on the other hand, there should be word-final lengthening if and only if unaccented words are not extra-metrical, as shown in (8a) as contrasted to (8b): (8)
a. [ [
x X
X
X
X
HER
X
pop OPPOSED
1
1
H*
H*
b- [ [ x ;]
]
x X
X
X
X
X X
X X
X
X
the marriage.
HER
1
L%
L
H*
X
] ] X
X
X
X
X
poppa POSED a problem.
1 H*
L
L
]
X
X
X
X
x
X
X
X
X:
x x
HER
pop OPPOSED the marriage.
HER i
poppa POSED
1
x ][x
x
[ x ]
X
H*
vs. [ [
]
X x
] [
a
problem.
1
1
H*
L
L%
The crucial first prediction represented in (6) was not borne out. All of the subjects showed a longer [a] in pop and a longer [9] in poppa in postnuclear position. These differences were not significant for all subjects at all rates, but the effect was consistently in the same direction, as illustrated in figure 9.6, which shows the mean durations of the target vowels in postnuclear position for all five subjects. These results indicate that word-final lengthening can occur in the absence of any possible accentual phrase boundary. Moreover, there was also an apparent wordfinal lengthening in both of the other accentual positions. This result can be observed in figures 9.7 and 9.8 which show the mean durations in prenuclear and nuclear positions, respectively. In both figures, the means for the [a] in pop are longer than those for poppa, and the means for the [9] in poppa are generally longer than those for pop opposed. From the mutually exclusive predictions depicted in (7) and (8), however, only one of these positions should have allowed word-final lengthening. Thus, the results of experiment 2 show that word-final lengthening cannot be the boundary mark for an accentual phrase as defined in (4). 167
MARY E. BECKMAN AND JAN EDWARDS
Q. Q. O Q.
o I o
fast normal slow fast normal slow fast normal slow
Q.
o :300Q.
-300
200-
-200
100-
-100
C
o 13
fast normal slow fast normal slow fast normal slow
Figure 9.6 Mean durations of the vowels in the target syllables in the experiment 2 sentences with pop or poppa in postnuclear position. Means averaged over 10 tokens at each rate for each speaker.
We can think of two possible explanations for these results that are consistent with the hypothesis that the word-final lengthening effect marks some sort of phonological constituent. First, its domain could be a prosodic constituent below the level of pitch accents, perhaps a "stress foot" like that proposed in Selkirk (1981). Second, its domain could be a constituent that is independent of the prosodic hierarchy — for example, some sort of "phonological word" containing more than one stressed syllable, which could have no accents or as many accents as stresses.
9.6
Experiment 3 - stress foot or independent phonological word?
We designed a third experiment to distinguish better among the various levels of the grid below the intonational phrase, using the target phrases superstition, super 168
Lengthenings and shortenings and the nature of prosodic constituency Q_ C
250
o" "poppa" #
w
150 |
100
O "D T3 O
50 fast normal slow fast normal slow fast normal slow
pop
6 o
200
i
300-
* 200 c o 100
c CD
0)
fast normal slow fast normal slow
fast normal slow
Figure 9.7 Mean durations of the vowels in the target syllables in the experiment 2 sentences with pop or poppa in prenuclear position. Means averaged over 10 tokens at each rate for each speaker.
station, and Sioux perspective. These phrases had different word boundary placements, but identical stress foot structure: (9)
[ ] superstition xx x x x x
[ ] [ ] vs. super station xx x x x x
[ ] [ ] words vs. Sioux perspective x x stresses x x x x
The phrases were placed in contexts to produce a three-way contrast in accent placement pattern, as shown in Table 9.3. In the first pattern, the sentence stress is on make, so that there can be no accents on either stressed syllable in the target phrase because it is in postnuclear position. The second pattern placed "scooped" L* + H accents on the word real preceding the target phrase and on the second stressed syllable in the target phrase, but no accents on the first syllable, so that the first syllable in the target phrase is prenuclear but unaccented. (This is the uncertainty contour described by Ward and Hirschberg 1985.) The third pattern placed a prenuclear L* accent on the first stress and a nuclear H* on the second 169
MARY E. BECKMAN AND JAN EDWARDS
250 200-
150ddod
o
500
500
c
400
400
*o
300
300
200
200
100
100
mean dijration
50
pop
100
O i 1
fast normal slow fast normal slow fast normal slow
50
o D O •D
•
0
fast normal slow fast normal slow fast normal slow
0 •
Figure 9.8 Mean durations of the vowels in the target syllables in the experiment 2 sentences with pop or poppa in nuclear position. Means averaged over 10 tokens at each rate for each speaker.
stress in the target phrase. (This is the surprise-redundancy contour described by Sag and Liberman 1975.) These target phrases and contrasting accent placements were designed to differentiate among three hypotheses about word-final lengthening. The first is again the notion that the lengthening marks accentual phrases. The test for this hypothesis is that, since a lexical item can have more than one accent, it should be possible to have word-final lengthening internal to lexical items. Thus, superstition should pattern like super station; its [u] should be shorter and its |>] longer than in Sioux perspective, but any difference among the three phrases should hold only when both stressed syllables are accented, in the surprise-redundancy contour: (10)
x X
X
X
X
X X
X X
X X
X X
super station
superstition
I
I
I
I
L*
H*
L*
H*
X X X
Sioux perspective
170
Lengthenings and shortenings and the nature of pro so die constituency Table 9.3 Intonation patterns for experiment 3 1. postnuclear You may call it a superstition, but that doesn't MAKE it a superstition.
I
H#
L
2. uncertainty contour Q. Do you have any feigned beliefs? A. I have a real superstition. f
I
L0/ °
I
L* + H L* + HL H% 3. surprise-redundancy contour Don't you understand?! It's a superstition!
I
I
L*
H*L L%
The second hypothesis is that word-final lengthening marks a constituent at the level of the lexical stresses just below the accents. If this hypothesis is correct, then there should be the durational patterns just described (the vowels of superstition patterning exactly like those of super station and so on), but without the dependency on accent placement: (11)
[x XX
] [ x ] [ x ] [ x XX
XX
] XX
super station superstition
[ x] [ X
x
]
stress feet
X X X
Sioux perspective
The third possibility is that phrasing below the intonational phrase level is independent of the prosodic hierarchy, and that the word-final lengthening marks a "phonological word" that is not necessarily headed by a single prosodic peak such as an accent or a stress. In this case, final lengthening should occur only at the edges of actual lexical items, so that the [a*] in superstition should always be shorter than that in super station: (12)
[ ] [ ] super station
[ 1 [ ] superstition Sioux
[ ] perspective
phonological words
Six subjects (four of the same subjects as in experiments 1 and 2 and two additional subjects) read five tokens each of the sentences at three rates. We measured the durations of the [u] and |>] in the target phrases, superstition, super station, and Sioux perspective. The measurements showed two different patterns, depending on the subject. Two subjects had a longer [u] in Sioux perspective and a longer [a*] in super station, but never any apparent lengthening of the [a*] in superstition, regardless of accent placement. Figure 9 shows such results for subject JSC, for whom these differences were significant only at the slow rate. Similar results were observed for 171
MARY E. BECKMAN AND JAN EDWARDS
O — O Sioux perspective super station superstition 300 250
c o
200 150
^
100 50
-o
250 200 150 100 50
fast
normal postnuclear
slow
fast
normal uncertainty
slow
fast
normal
slow
redundancy
Figure 9.9 Mean durations of the vowel [u] in the first syllable (top) and the vowel [>] in the second syllable (bottom) of the target sequences in the experiment 3 sentences produced by subject JSC. Means averaged over 5 tokens at each rate for each intonation pattern.
subject KDG. These results do not support either the accentual phrase or the stress foot as the domain of word-final lengthening. Rather, they suggest that final lengthening occurs only at the edges of actual lexical items, bounding a constitutent that is independent from the hierarchy of stresses and accents. The remaining four subjects also had a longer [u] in Sioux, but only in the suprise-redundancy contour, where it was accented as well as stressed. Moreover, for these subjects, the [a1] in superstition tended to pattern like that in super station; in both words, it was often longer than in Sioux perspective, although the differences were generally not significant. These results are illustrated in figure 172
Lengthenings and shortenings and the nature of prosodic constituency O — O Sioux perspective super station superstition
250
c o
200
mean [u] durat
ms)
300
150
\
100
i
50 0 200
150 c o 100
c CO |
50
fast
normal postnuclear
slow
fast
normal uncertainty
slow
fast
normal
slow
redundancy
Figure 9.10 Mean durations of the vowel [u] in the first syllable (top) and the vowel [a*] in the second syllable (bottom) of the target sequences in the experiment 3 sentences produced by subject JRE. Means averaged over 5 tokens at each rate for each intonation pattern.
9.10, for subject JRE. Similar results were observed for subjects LAW, BDM, and EXE. (Subjects LAW and EXE had significant differences in the [a*] at normal or slow rates for only the surprise-redundancy contour.) The similarity of superstition to super station and the dependency on accent pattern for any effect support the accentual phrase as the domain of word-final lengthening. Taken alone, the results of experiment 3 suggest that speakers differ in their interpretations of word-final lengthening, and that there are (at least) two possible interpretations of the effect; some speakers use it to mark phonological words and others to mark accentual phrases. However, there are problems with this 173
MARY E. BECKMAN AND JAN EDWARDS
conclusion when the results are considered together with those of experiment 2. Recall that all of the subjects in the second experiment had word-final lengthening in all positions, regardless of accent placement, a result that supports some phonological constituent other than the accentual phrase as the domain of wordfinal lengthening, although it is not clear what the something else is, because the target phrases do not differentiate between the stress foot and the prosodically independent phonological word. But four of the subjects in that experiment were speakers who displayed the second pattern in experiment 3. That is, these four subjects showed word-final lengthening in pop opposed versus poppa posed regardless of the accentual status of the target syllable, but showed no word-final lengthening in Sioux perspective versus super station except when the target syllable bore a pitch accent.
9.7
Focus constituents
In attempting to reconcile these apparently contradictory results, we note a possibly crucial difference between the two experimental corpora - namely, that the word boundaries in Sioux perspective and super station are internal to a noun phrase whereas those in pop opposed and poppa posed coincide with the major syntactic division between the subject and predicate of the sentence. Perhaps, then, the relevant constituent triggering word-final lengthening is not words, but some sort of syntactic phrase, as assumed by earlier researchers such as Klatt (1975) and Cooper and Paccia-Cooper (1980). There are problems with this interpretation, arising from the fact that we must still differentiate between two classes of subjects in experiment 3. Recall that the close syntactic relationship between Sioux and perspective did not prevent wordfinal lengthening in Sioux for JSC and KDJ. Thus, syntactic constituency must be relevant only for the other four subjects. On the other hand, it was just these four other subjects who showed accent-dependent word-final lengthening in experiment 3. Thus, it cannot be syntactic constituency alone that is relevant here, but rather some consequence of syntactic structure for some other property of utterances, some property that bears on accent placement. A possible candidate is the property of focus. A consequence of the different syntactic structures involved in the two experiments is that, when the target syllable is in a prenuclear unaccented position in experiment 2, it is not part of the following VP constituent containing the nuclear-accented verb, whereas in the analogous accentual position in experiment 3 it is part of the NP constituent containing the nuclear accent. Selkirk (1984) proposes an account of the prosodyfocus relation in which a hierarchical representation of focus structure is built on syntactic constituents with reference to accent placement. In this account, the different syntactic structures in the two experiments would result in the following
Lengthenings and shortenings and the nature of prosodic constituency
focus structures for the sentences with the target syllable in prenuclear unaccented position: (13)
a. Experiment 2
NP / \ F(Det) N
b. Experiment 3
F Ad) F(N)
I
Her pop opposed the marriage.
I
H#
I
I have a real super station.
I
I L*+H
L%
I L*+HLH%
In these structures, we are using Selkirk's (1984) notation F( ) for a constituent under focus. The accent on Her in (a) reflects a narrow focus on that word alone, but the accent on station in (b) reflects a broad focus on the entire noun phrase of which station is the head. Thus the two experiments differ in that the NP constituent containing the target syllable is not focused in experiment 2, but it is focused in experiment 3. Suppose, for argument's sake, that this representation of focus structure is correct, and suppose further that the accentual phrasing of an utterance refers not just to accented words, but also to some next-higher level focus constituent. Then, given the right assumptions about extrametricality and so on, a new accentual phrase would begin after the pop or poppa in experiment 2, but not after the Sioux or super in experiment 3. In this way, the differences between the two experiments in the prenuclear unaccented condition might be explained in terms of the accentual phrase for those four subjects who had accentdependent word-final lengthening in experiment 3, thus reconciling somewhat the interpretation of experiment 3 with the results of experiment 2. One problem with this explanation is that the results for postnuclear position for these subjects would still be troublesome, because of the narrow focus indicated by the nuclear accent on her in experiment 2, as shown in (14). Here no parsing of the accentual phrase as a focus constituent will give an accentual phrase boundary after the pop. (14) F(Det) N Her pop opposed the marriage. KP
L
L%
A more serious problem with this explanation is that it relies crucially on several 175
MARY E. BECKMAN AND JAN EDWARDS
untested suppositions. Any definite proposal invoking focus constituents must wait until we have as much evidence about the structure of focus constituency and its relationship to accent placement in English as we now do have about prosodic structure and intonation. We know of a few experiments in this area (e.g. Eady et al. 1986; Wells 1986), but these few do not differentiate the relevant phonological structures in a way that would make them useful in determining how prosodic constituency reflects focus structure. We intend to investigate the relationships among focus structure, accent placement, and word-final lengthening in future experiments, but any further speculation at this stage would be premature.
9.8
Conclusion
While our experiments have raised more questions than they have resolved, they do sustain two important conclusions. First, they strongly suggest that there are two different prosodic boundary effects: phrase-final lengthening and word-final lengthening. Phrase-final lengthening occurs at intonational-phrase boundaries, and is a large effect that is highly consistent across speakers and rates. Word-final lengthening occurs at some other constituents' boundaries, and is a much smaller effect that is not consistently evident across speakers and rates. Second, the wordfinal effect cannot be explained in terms of isochronous metrical intervals, but must refer to a constituent that is delimited by boundaries of some sort. However, because of the small sizes of the differences involved and the possibility of interspeaker differences, it is difficult to determine what that constituent is and how it relates to the prosodic hierarchy of stresses and accents.
Notes We thank John Sidtis for letting us use his analysis facilities at New York Hospital (developed under NIH grants NS 18802-03 and NS 17778-04). We also thank Osamu Fujimura's department at AT&T Bell Labs for lending an AT&T PC 6300 and A/D board to the lab at Ohio, and Joan Miller for letting us use her speech and graphics routines in a waveform editor. This paper is based in part upon work supported by the National Science Foundation under Grants No. IRI-8617852 and IRI-8617873. 1 The alignment of the phrase accent has been somewhat problematic. Pierrehumbert (1980) suggested that it spreads over the syllables after the final accent, thus filling the space between the nuclear accent and the edge of the intonational phrase. However, she rejected a phonological spreading rule because of the tone's realization in phrases where there are no minimal tone-bearing units following the nuclear-accented syllable, which made the mechanism of the alignment unclear because she did not specify an alternative phonetic representation for the spreading. More recently, Beckman and Pierrehumbert (1986) have posited that the phrase accent defines an intermediate level of phrasing between the intonational phrase and whatever lower-level constituent is the domain of the pitch accent, and in Pierrehumbert and Beckman (1988) they suggest further that it is aligned as a right-peripheral
Lengthenings and shortenings and the nature of prosodic constituency (boundary) tone to this intermediate phrase and simultaneously as a right-peripheral tone linked to the nuclear-accented word. When the end of the intermediate phrase is far from the end of the nuclear-accented word, this dual attachment at the two levels accounts for its apparent spreading, without necessitating a new phonetic mechanism. 2 Note, however, that if unaccented words are interpreted as not belonging to any accentual phrase, intonational patterns such as the one shown in (6) would result in a rather unorthodox notion of extra-metricality in which an arbitrary number of stress feet can be extra-metrical rather than the usual single rightmost or leftmost one.
References Anderson, Mark D, Janet B. Pierrehumbert and Mark Y. Liberman 1984. Synthesis by rule of English intonation patterns. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 2.8.2-2.8.4. Beckman, Mary E. 1986. Stress and Non-Stress Accent. Netherlands Phonetic Archives 7, Dordrecht: Foris. Beckman, Mary E. and Janet B. Pierrehumbert. 1986. Intonational structure in English and Japanese. Phonology Yearbook 3: 255-309. Cooper, William E. and Jeanne Paccia-Cooper (1980). Syntax and Speech. Cambridge, MA: Harvard University Press. Eady, Stephen J., William E. Cooper, Gayle V. Klouda, Pamela R. Mueller, and Dan W. Lotts. 1986. Acoustical characteristics of sentential focus: narrow vs. broad and single vs. dual focus environments. Language and Speech 29: 233-251. Fowler, Carol Ann. 1977. Timing control in speech production. Ph.D. dissertation, University of Connecticut. (Distributed by the Indiana University Linguistics Club.) Gee, James Paul and Francois Grosjean. 1983. Performance structures: a psycholinguistic and linguistic appraisal. Cognitive Psychology 15: 411-458. Hirschberg, Julia and Janet Pierrehumbert. 1986. Intonational structuring of discourse. Proceedings of the 24th Meeting of the Association for Computational Linguistics, 136-144. Huggins, A. W. F. 1975. On isochrony and syntax. In G. Fant and M. A. A. Tatham (eds.) Auditory Analysis and Perception of Speech., Orlando: Academic Press. Klatt, D. H. 1975. Vowel lengthening is syntactically determined in connected discourse. Journal of Phonetics 3: 129-140. Liberman, Mark and Janet Pierrehumbert. 1984. Intonational in variance under changes in pitch range and length. In M. Aronoff and R. T. Oehrle (eds.) Language Sound Structure. Cambridge, MA: MIT Press, 157-233. Oiler, D. Kimbrough. 1973. The effect of position in utterance on speech segment duration in English. Journal of the Acoustical Society of America 54: 1235-1247. Pierrehumbert, Janet B. 1980. The phonology and phonetics of English intonation. Ph.D. dissertation, MIT. (Distributed by the Indiana University Linguistics Club 1988.) Pierrehumbert, Janet B. and Mary E. Beckman. (1988). Japanese Tone Structure. Linguistic Inquiry Monograph. Cambridge, MA: MIT Press. Sag, Ivan and Mark Liberman. 1975. The intonational disambiguation of indirect speech acts. Papers from the 11th Regional Meeting of the Chicago Linguistic Society, 487-497.
177
MARY E. BECKMAN AND JAN EDWARDS
Selkirk, Elisabeth O. 1980. The role of prosodic domains in English word stress. Linguistic Inquiry 11: 563-605. 1981. On the nature of phonological representation. In J. Anderson, J. Laver, and T. Meyers (eds.) The Cognitive Representation of Speech. Amsterdam: North-Holland. 1984. Phonology and Syntax. Cambridge, MA: MIT Press. Ward, Gregory and Julia Hirschberg. 1985. Implicating uncertainty: The pragmatics of fall-rise intonation. Language 61: 747-776. Wells, William H. G. 1986. An experimental approach to the interpretation of focus in spoken English. In Catherine Johns-Lewis (ed.) Intonation in Discourse. London: Croom Helm.
10 On the nature of prosodic constituency: comments on Beckman and Edwards's paper ELISABETH SELKIRK
10.1 Prosodic structure theory Early work in generative phonology, for example, Chomsky and Halle (1968) and McCawley (1968), made it clear that in order to account for the manner in which rules of phonology apply in environments larger than the word, the sentence must be analyzed into a sequence of domains of different levels. In these works boundary symbols gave notational representation to domains, and there was no necessary coincidence between the boundaries of different domains. The theory that the representation of these phonological domains is instead a hierarchically arranged prosodic constituent structure, as proposed in Selkirk (1978, published in 1981) and developed elsewhere (see references), is based on a number of observations. A first observation is that the limits of higher order (larger) phonological domains systematically coincide with the limits of lower order domains. This leads naturally to a theory of the representation of these domains as a well-formed bracketing or tree, with each instance of a domain of a particular level being a constituent. A tree representation predicts this relation between domains, while boundary theory does not.1 Moreover, it seems to be the case that a higher order constituent Pn immediately dominates only constituents of the next level down P n l . In other words, representations appear to have the general character of that in (1), where PPh, PWd, etc. are simply convenient names for constituents of level P1 and Pj, etc., in the hierarchy of phonological domains: (1)
(
) Utt
( (
)(
)(
) IPh
)(
) PPh
(
)(
)(
)(
(
)(
)(
)(
)( )(
)(
179
) PWd ) Ft?
ELISABETH SELKIRK
A "strictly layered" representation like this is characterizable by the schema in
(If: (2)
Pn -> Pn~1 *
(X* means ' one or more X's')
A further observation, of fundamental importance, is that the prosodic constituent structure of a sentence is in no way isomorphic to its syntactic constituency, while bearing some relation to it, and this leads to the view that the prosodic constituency is properly phonological. From the very start, work on the theory of prosodic structure has contended with this central question: what is the nature of the relation between the prosodic structure of a sentence and its syntactic structure ? Most researchers agree that there is typically a systematic relation between the two (cf. Selkirk 1978; 1980; Nespor and Vogel 1982, 1986; Booij 1983; Hayes 1984; Selkirk 1984, 1986; Shih 1985; Chen 1987; Hale and Selkirk 1987; and others), though from one language to another it may not have exactly the same character. I believe that some headway is being made towards a general theory of the syntax-prosodic structure relation, and I will report on that below. A second question concerns the extent to which prosodic structure is autonomous of the syntax. It is entirely conceivable that the properties of prosodic structure enumerated above could follow from the properties of the mapping between syntactic and prosodic structure. An alternative view, which I believe is more likely to be correct, is that there are independent principles of well-formedness for prosodic structure, some of which may vary in parameterized fashion from one language to the next, and that these combine with the principles of the syntax-phonology mapping in the assignment of a prosodic structure to a sentence. In Selkirk (1986) and Hale and Selkirk (1987) it is argued that two parameters play a pivotal role in the mapping of syntactic representation into that hierarchy of prosodic domains which forms the essential constituency of phonological representation. The first is the designated category parameter. It incorporates the hypothesis that for each level Pt of the prosodic hierarchy there is a single designated category D Q of syntactic structure, e.g. maximal projection or lexical word, with respect to which phonological representation at level Vi is defined. The second is the end parameter, and embodies the hypothesis that only one end (Right or Left) of the designated category DQ is relevant in the assignment of a prosodic constituent Vi: the sole requirement imposed is that the R/L end of each DCX in syntactic structure coincide with the edges of successive Vi in prosodic structure (see (1)). Together the designated category and end parameters form a composite mapping parameter. The hypothesis is that the set of such composite parameters in the grammar of a language constitute the syntactic constraints on the prosodic structure assigned to any sentence of the language. There appears to be a role for independent principles of prosodic structure as well. The schema given above in 180
On the nature of prosodic constituency
(2), which says that the prosodic structure of a sentence has a strictly layered tree representation, can be viewed as a constraint on the prosodic structure of a sentence in any language, always operating in concert with the syntactic constraints on prosodic structure. Moreover, individual languages may impose additional phonological constraints on prosodic structure. The role for syntactic constraints on the mapping will be illustrated in section 2 with an example from Japanese, as will the interaction of these with independent prosodic constraints. The basic phonology of tone in Japanese and the rules for the phonetic implementation of tone crucially involve prosodic structure at two levels. The work of McCawley (1968), Poser (1984) and Pierrehumbert and Beckman (1988) has been particularly important in establishing this. The designated categories for these levels, we will see, are lexical word (X°) and maximal projection (Xmax). In Japanese it is the left end of each of these syntactic categories which forms the locus of correspondence to the prosodic structure. In section 3 I will suggest that assuming a prosodic organization into phonological words (PWd) and phonological phrases (PPh) for English, with the locus of correspondence to the syntax being the right edge of (lexical) X° and Xmax,3 respectively, gives a basis for a better understanding of Beckman and Edwards's (this volume) data on the domain(s) of final lengthening in English. Do languages always show an organization into these two levels of prosodic structure - i.e. phonological word and phonological phrase? It is too soon to say. Certainly many languages show no obvious phonological phenomena requiring the positing of prosodic structure at either of these levels, while others show phonological evidence for only one or the other. It is encouraging that phonetics may provide evidence on this important question of phonological representation. Final lengthening, if itself a universal aspect of phonetic implementation, might be just the sort of phenomenon that permits investigating whether phonological words and phrases are universally present in the prosodic structure of sentences.
10.2 Prosodic structure in Japanese Words in Japanese come basically in two varieties, accented or unaccented. The tonal contours that result from concatenations of accented and unaccented words in the sentence depend (indirectly) on the syntactic relations obtaining between the words. McCawley (1968) showed that an analysis of the tonal properties of Japanese sentences required positing two levels of phonological domain. Poser (1984), Beckman and Pierrehumbert (1986) and Pierrehumbert and Beckman (1988), who each have contributed to developing an explicit phonetic implementation for a phonological representation of Japanese tone, build on the McCawley phrasing analysis. It is within McCawley's major phrase (Pierrehumbert and Beckman's intermediate phrase) that the downstepping (catathesis) of one accented word with respect to another preceding accented word takes place. An 181
ELISABETH SELKIRK
accent will not be downstepped if it is not within the same major phrase as a preceding accent. A minor phrase (Pierrehumbert and Beckman's accentual phrase) has the property that it contains at most one accent and has at its left edge an "initial lowering," a rise in F o from the first to the second mora. A major phrase consists of a sequence of minor phrases, hence an initial lowering will initiate a major phrase as well. In neither McCawley (1968) nor any subsequent published work is there a characterization of the syntax of major phrases. In an unpublished paper, Terada (1986) suggests that the left edge of a maximal projection marks the limits of major phrases. This idea is pursued here. As for the minor phrase, it is generally recognized (Hattori 1949, McCawley 1968; Kohno 1980; Miyara 1980) that it consists minimally of a lexical word (i.e. noun, adjective, verb) and any and all function words appearing to its right. Some progress has also been made in understanding under what syntactic circumstances a minor phrase may consist of more than one lexical word (Kohno 1980; Poser 1984). Koichi Tateishi and I are currently engaged in a program of research whose purpose it is to investigate further the syntactic constraints on major and minor phrases in Japanese. In the following paragraphs I will report on the results of one experiment which strongly suggest that the general approach to the syntax-phonology mapping sketched in section 1 holds true for Japanese. All the sentences investigated in our experiment consist of a sequence of three nouns followed by an unaccented verb. There are two sets - in one, the A set, all the nouns are accented; in the other, the U set, all the nouns are unaccented. The sentences of each set contain the same noun in first, second and third position, forming minimal n-tuples. The A set sentences are: ao'yama-case yama'guchi-case ani'yome-case inai/yonda
The U set sentences contain the nouns below, in that order: oomiya-case inayama-case yuujin-case inai/yonda
Within the A and U sets the sentences vary in syntactic structure, and hence in case making. The sentence types are the same in the A and U sets; they are shown in figure 10.1. With these test materials we sought to discover what correlations there might be between syntactic structure and the aspects of sentential tone patterns mentioned above, namely Downstep and Initial Lowering. The sentences of the two sets were given to four speakers of Tokyo Japanese, who were asked to pronounce each of them twice at slow, normal and fast tempos. Pitch tracks from the recordings were analyzed (actual pitch tracks for two speakers are given in the appendix). The results were sufficiently consistent across 182
On the nature of prosodic constituency Type i
Noun^no Noun2-no Noun3-ga Verb Type ii
NP
NP
Nourvno Noun2-no Nouh3-ga Verb Type in
NP
i Noun^no Noun2-ga Noun3-o Verb Type iv
Nounvni Noun2-no Noun3-ga Verb Type v
s
NP
NP
Noun-,-ga Noun2-no Noun3-o Type vi
S
NP
NP
Noun^ga Noun2-ni Noun3-o Verb Figure 10.1
ELISABETH SELKIRK
speakers that they can be represented schematically: figure 10.2 illustrates the patterns in the A set, figure 10.3 illustrates the patterns in the U set. The dotted lines in the figures indicate differences between speakers that will be discussed here. A full report of the results cannot be given at this point. The pitch curves of figure 10.2 show the presence of Initial Lowering at the beginning or left of each noun. The data are thus consistent with the view that the organization of the sentence into phonological words (minor phrases) in Japanese is submitted to the syntactic constraint in (3), which expresses a parameter (composed of designated category and end) of the syntax-prosodic structure mapping. (3)
Japanese phonological word parameter: x-iexL
This should be interpreted as saying that the left edge of a lexical word in syntactic structure necessarily corresponds to the limits of a prosodic constituent, call it PWd (or minor phrase). Since function words are not picked out in the mapping, they are predicted to be in the same PWd as a preceding lexical word or function word.4 This is essentially the analysis proposed for minor phrases by Kohno (1980). What we also see in the set A sentences in figure 10.2 is a pattern of downstepping and lack of downstepping that is very systematic. The generalization is that a lack of downstepping of N2 with respect to Nx or of N 3 with respect to N2 occurs when the second of the pair initiates a new syntactic phrase, XP. Given our assumption that the domain of downstep is a phonological constituent, i.e. that it is only within a major phrase that downstep occurs, the generalization can be put in different terms: the left edge of XP is the locus of the limits of a major phrase. This is essentially what Terada (1986) proposes. We can express this as the mapping parameter (4): (4)
Japanese phonological phrase parameter: X-max[
Parameters (3) and (4) together, in combination with the general constraint (2) on the nature of prosodic structure, account for the assignment of prosodic structure to sentences in Japanese that we see exemplified in lines (b) and (c) of the examples in figure 10.2. Downstep and Initial Lowering apply in the expected way to a phonological representation containing such prosodic structure assignments. Something more needs to be said about the type (i) case in figure 10.2. For three out of the four speakers the third noun of the left branching noun phrase is not downstepped with respect to the immediately preceding noun (cf. Type (i) for NH in the appendix). Does this mean it is in a separate major phrase? Note that here there is no left edge of XP to which this putative major phrase break could correspond. Following Terada (1986), we might speculate that for these speakers 184
On the nature of prosodic constituency
there is an autonomous prosodic condition on major phrases, restricting them to just two minor phrases. Even so the actual value of the third noun is still relatively low with respect to the first peak in the phrase. Indeed it appears to be downstepped with respect to the first noun, if not the second. An alternative interpretation, then, is that (a) all non-initial accents in a major phrase are downstepped with respect to an initial accent, but that (b) downstepping is not invariably cumulative (contra Pierrehumbert and Beckman 1988). In order to investigate this issue further, larger left-branching sequences will have to be examined. There is a final puzzling fact concerning the A set. The sentences of types (iv) and (v) sometimes show a greater tonal prominence of the third noun then in sentences of type (ii), though they are predicted to behave the same. (This pattern is illustrated in the NH sentences in the appendix.) We do not know to what this difference should be attributed. Let us turn now to the sentences of the U set, which contain no accented words. They are displayed here in figure 10.3. The tonal patterning of the U-set sentences looks on the face of it rather different. Since there are no accents, there is no downstepping to be had. What we observe simply are differences in the presence and absence of Initial Lowering before each putative minor phrase. As we will see, this indirectly gives us data about the organization of the sentence into major phrases that turns out to coincide sentence type by sentence type with the distribution of downstepping in set A. One fact is clear from figure 10.3 - that two unaccented nouns may join together in a single minor phrase, unseparated by initial lowering. McCawley and Kohno have taken this as indicating that an accentless minor phrase may be incorporated into another adjacent minor phrase. We will assume this analysis for now (but see note 5). The important question for us is which nouns may join together. The data in figure 10.3 show there to be a systematic correlation between syntactic structure and accentless minor phrase incorporation. Initial Lowering, indicating the initiation of a minor phrase, coincides with the left edge of a XP. It might be hypothesized that the presence of a major phrase boundary at this XP edge forces the presence of a minor phrase boundary, and hence the Initial Lowering. We could understand this as falling out from the general constraint on prosodic structure expressed in (2). Accentless minor phrase incorporation, on this story, would be restricted to the confines of the higher order division into major phrases, a "top-down" effect.5 In line (b) of the examples in figure 10.3 is the major phrasing predicted by the phrasing parameter (4). In line (c) is the minor phrase organization evidenced by the presence of Initial Lowering. It will have been produced through the operation of incorporation, operating on the original structure assigned in virtue of constraints (3), (4) and (2). Two additional facts require comment. First, speakers varied in whether they showed initial lowering before the third noun in the left branching type (i) 185
ELISABETH SELKIRK
i.
a. Ao'yama-no Yama'guchi-no ani'yome-ga b. ( c ( ) ( ) (
inai ) MaP ) MiP
"We cannot find the sister-in-law of Mr. Yamaguchi from Aoyama.'
ii.
a. Ao'yama-no Yama'guchi-no ani'yome-ga b. ( ) ( c ( ) ( ) (
inai )MaP )MiP
"We cannot find Mr. Yamaguchi's sister-in-law from Aoyama."
iii.
a. b. c.
Ao'yama-no Yama'guchi-ga ani'yome-o ( ) ( ( ) ( ) (
yonda
"Mr. Yamaguchi from Aoyama called his sister-in-law." Figure 10.2 for legend see opposite 186
) MaP ) MiP
On the nature of prosodic constituency
a. Ao'yama-ni Yama'guchi-no ani'yome-ga b. ( ) ( c. ( ) ( ) (
inai ) MaP ) MiP
"Mr. Yamaguchi's sister-in-law is not in Aoyama."
a. Ao'yama-ga Yama'guchi-no ani'yome-o yonda b. ( ) ( ) MaP c. ( ) ( ) ( ) MiP 'Mr. Aoyama called Mr. Yamaguchi's sister-in-law."
vi.
a. Ao'yama-ga Yama'guchi-ni ani'yome-o b. ( ) ( ) ( c. ( ) ( ) (
yonda ) MaP ) MiP
"Mr. Aoyama called his sister-in-law to Yamaguchi." Figure 10.2 The A set: accented sentences Ao'yama-case Yama'guchi-case ani'yome-case inai/yonda (Aoyama Yamaguchi wife of elder brother not exist/called)
187
ELISABETH SELKIRK
a. b. c
Oomiya -no Inayama-no yuujin-ga ( ( ) ( ) (
inai ) MaP ) MiP
'We cannot find the friend of Mr. Inayama from Oomiya."
a. b. c.
Oomiya-no Inayama-no yuujin-ga ( ) ( ( ) ( ) (
inai ) MaP ) MiP
"We cannot find Mr. Inayama's friend from Oomiya."
iii.
a. Oomiya-no Inayama-ga yuujin-o b. ( ) ( c ( ) ( ) ( "Mr. Inayama from Oomiya called his friend."
Figure 10.3 for legend see opposite
188
yonda ) MaP ) MiP
On the nature of prosodic constituency
iv.
a. Oomiya-ni Inayama-no yuujin-ga b. ( ) ( c ( ) ( ) (
inai ) MaP ) MiP
"Mr. Inayama's friend is not in Oomiya."
a. Oomiya-ga Inayama-no yuujin-o b. ( ) ( c. ( ) ( ) (
yonda ) MaP ) MiP
"Mr. Oomiya called Mr. Inayama's friend."
\
a. Oomiya-ga Inayama-ni yuujin-o b. ( ) ( ) ( c ( ) ( ) ( " Mr. Oomiya called his friend to Inayama." Figure 10.3 The U set: accentless sentences Oomiya-case Inamaya-case yuujin-case inai/yonda (Oomiya Inamaya friend not exist/called)
189
yonda ) MaP ) MiP
ELISABETH SELKIRK
sentence. All speakers showed initial lowering here at a slow rate of pronunciation. At the fast tempo two had Initial Lowering at this location, and two did not. The presence of a minor phrase break here might be additional evidence that the prosodic constituency of phrases is maximally binary, under certain conditions at least. Perhaps unaccented minor phrase incorporation is possible up to binarity for all speakers at slow rates, while two speakers retain the binarity restriction at all rates. The second fact concerns the inter-speaker variation shown in the initial lowering of the third noun in type (iii) and (vi) sentences, where each noun constitutes a separate noun phrase and hence should initiate a separate major phrase. Two speakers showed a lack of initial lowering here in normal and fast tempos, suggesting that the final minor phrase (itself a major phrase) is incorporated into the preceding one, even despite the intervening major phrase boundary. (Compare NH and KT on type (iii) sentences in the appendix.) The circumstances under which a major phrase boundary can be ignored and the motivation for the binarity restriction on prosodic structure require further investigation. Finally, a comment on the status of the unaccented verb in these sentences is in order. The verb appears to make no contribution to the tonal patterning of the sentences investigated. The generalizations stated above make no reference to its presence. Perhaps such verbs are "extra-prosodic." Only a more complete examination will allow us to say more. To summarize, our examination of the tonal patterns of a controlled set of Japanese sentence types has led to the conclusion that the assignment of prosodic structure to Japanese sentences is systematically constrained by (a) syntactic factors, along the lines suggested in Selkirk (1986) and Hale and Selkirk (1987), and (b) prosodic factors, both universal (as expressed in (2), the "strict layer constraint") and Japanese-particular (the apparent binarity constraints at play for some speakers). Anticipating now the discussion of final lengthening to follow, it would be of great interest to see whether the ends of major and minor phrases, motivated in the analysis of tonal patterning, are the locus of any sort of constituent-final lengthening. It would be especially interesting to see whether the phrases found internal to the left-branching noun phrase in the type (i) sentences also show final lengthening, for these phrase limits are presumably not imposed by the syntax, but rather by the prosodic structure constraints of Japanese. Were these phrase-ends treated in a way parallel to the "syntactically-based" phrase ends, then this would be strong evidence that final lengthening is established on the basis of a prosodic constituency, rather than a syntactic one.
190
On the nature of prosodic constituency
10.3 Prosodic structure and final lengthening in English The Beckman and Edwards experiments on final lengthening reported in their article in the present volume were designed to investigate the truth of three propositions: (i) that there exists a constituent-final lengthening effect in English, (ii) that this final lengthening is established with respect to a prosodic structure, rather than a syntactic structure, and (iii) that the prosodic structure of English sentences is assigned in virtue of the prominence patterns of sentences. The results of the experiments firmly support the first proposition, adding substantially to the body of evidence already in favor of it (e.g. Cooper and PacciaCooper 1980, and the references cited there). The experiments do not lead to conclusions of comparable firmness about propositions (ii) and (iii), though they offer data that are valuable in sorting out just what factors do affect the differential application of final lengthening. There are three possible candidates for the constituent structure which serves as the domain of final lengthening. I.
Surface syntactic structure (e.g. Lehiste 1973a, 1973b; Klatt 1975, 1976; Cooper and Paccia-Cooper 1980; Selkirk 19846), II. A prosodic structure submitted to syntactic constraints, and perhaps also prosodic constraints (see sections 1 and 2 above), and III. A prosodic structure assigned in virtue of the prominence pattern of the sentence (Beckman and Edwards) I believe the evidence comes out more strongly in favor of either of the first two than it does in favor of the theory of the final lengthening domain that Beckman and Edwards espouse. Beckman and Edwards propose - for English - that above the level of the syllable there are at most three levels of prosodic structure present in any representation: the foot (a grouping of syllables; includes one stressed syllable), the accentual phrase (a grouping of feet; includes one pitch-accented element), the intonational phrase (a grouping of accentual phrases; one of the pitch accents within the intonational phrase is the nuclear accent). The idea is that, corresponding to each instance of a prominence of a particular sort (stress, accent, nuclear accent), there is a constituent at the relevant level in the prosodic structure of the sentence. Thus, a sentence with one pitch accent will contain just one accentual phrase; a sentence with three stresses will contain three feet, etc. This might be called a "prominence-based" theory of prosodic structure. Nothing is said about where the limits between two constituents are located; indeed, syntax IQI
ELISABETH SELKIRK
is given no role in the assignment of a prosodic structure to a sentence. In the example from Beckman and Edwards given below, prosodic constituent limits coincide with the limits of syntactic words (the bracketing for feet and syllables is omitted): [ [
x ] ] [ x ] xx x x x x xx x x phonological structure x
I
I
H*
H* L L%
nuclear accent/intonational phrase accent/accentual phrase
If only one of the words were to bear a pitch accent then Beckman and Edward assume there would be only one accentual phrase assigned. They remain noncommittal on where its edges might fall (cf. their discussion of their examples (5)-(8)), however, since they do not assume the general strict layering constraint (2) on prosodic structure, according to which a sentence is exhaustively parsed into prosodic categories at the various levels. (2) would require that in a sentence with a single accentual phrase (were such to exist) the limits of that phrase would coincide with the limits of the sentence. According to this "prominence-based" theory of prosodic structure, a sentence with the same syntax, but different numbers and locations of accent, will have different prosodic structure representations. Beckman and Edwards predict this should be reflected in the facts of final lengthening. Their second experiment was designed to test this prediction. The results were negative. Varying the locations of accent in the sentences in (5) (Beckman and Edwards's (5)-(8)) gave rise to no change in the pattern of final lengthening: (5)
b. HER pop opposed the marriage/HER poppa posed the problem a. HER pop OPPOSED the marriage/HER poppa POSED the problem b. her POP opposed the marriage/her POPPA posed the problem
The constant presence of the final lengthening effect attested at the end of the subject noun phrase in the Pop opposed/Poppa posed pairs in (5) indicates that the effect is independent of the putative accentual phrasing. Indeed neither this nor the other two experiments give any evidence from final lengthening for the existence of an accentual phrase as conceived of by Beckman and Edwards. What constituent could be providing the domain for the final lengthening exhibited in the sentences in experiment 2? It is not the intonational phrase (though as Beckman and Edwards show in experiment 1 the IPh may constitute a domain of final lengthening - see below). The sentences of (5) each correspond to a single IPh - there is no IPh break at the locus of final lengthening. For the 192
On the nature of prosodic constituency
Beckman and Edwards theory of prosodic structure, there is only one remaining candidate for the final lengthening domain here - the foot. (Their experiment 3 is designed to investigate this possibility.) For the alternative theories I and II, there are yet further possibilities. For the syntactic theory of final lengthening it could be (syntactic) word-end and/or (syntactic) phrase-end that is providing the context for final lengthening in experiment 2. For theory II, according to which the prosodic structure assigned to a sentence is constrained by syntactic factors, it could be prosodic word and/or phonological phrase that is providing the domain for the final lengthening. Suppose there were phonological words in English, constrained by the end parameter ] x l e x . They would be assigned as in (6b) to the sentence structure (6a) shared by all the sentences of (5). Were English in addition, or instead, to have an organization into phonological phrases, based on the end parameter ] X m a x , then the sentence structure of (6a) would have the phrasing of (6c). (6)
a. [[[Det] [Noun]] N P [[Verb] [[Det] [Noun]] N P ] V P ] s b. ( ) ( ) ( ) c. ( ) ( )
PWd PPh
Since both have a boundary after the subject, either of these prosodic constituents, or both, could be the domain(s) for final lengthening in experiment 2. What is needed is an investigation that will tease apart the predictions of these various theories. Beckman and Edwards's experiment 3 makes a step in the right direction. The test sentences in experiment 3 included the noun phrases superstition, super station and Sioux perspective, which share the same pattern of stress (organization into "stress feet"), but differ in the location of word ends. The noun phrases were pronounced under three different accentual conditions. Any differences in lengthening amongst them would have to be attributable either to their differing organization into words (be it syntactic or prosodic) or to their differing accentual properties. Beckman and Edwards report that two subjects showed lengthening of the [u] in Sioux perspective and lengthening of the [a*] in super station (and nowhere else), in all accentual conditions. Their conclusion, which seems right, is that "These results do not support either the accentual phrase or the stress foot as the domain of word-final lengthening. Rather, they suggest that final lengthening occurs only at the edges of actual lexical items, bounding a constituent that is independent from the hierarchy of stresses and accents." The results from the remaining speakers in experiment 3 allow for a similar interpretation. Four speakers failed to show any lengthening effect at all internal to these noun phrases, except that Sioux was lengthened when it bore an accent. In other words, except for the accented Sioux case, the durational pattern of these noun phrases was for all intents and purposes the same. The absence of any wordfinal effect in these environments is of interest, for these speakers all showed
ELISABETH SELKIRK
constituent-final lengthening in experiment 2, at the end of the subject noun phrase. It seems quite plausible to entertain the hypothesis that these four speakers make a difference between word-final and phrase-final positions in their final lengthening, and that for them the word-final effect is exceedingly small, or nonexistent. A distinction between word-sized constituents and phrase-sized constituents can be made in a syntactically-constrained prosodic structure, as it is, of course, in the syntactic structure itself. Thus both these types of structure are better candidates for the domain of final lengthening than the Beckman and Edwards type of prosodic structure, based on levels of stress and accent. The lengthening of accented Sioux by these last four speakers remains to be explained. It does not seem plausible to see in this fact support for the Beckman and Edwards accentual phrase, given that experiment 2 and the other results from experiment 3 show there to be no role for such an entity. Perhaps what we are seeing here is an effect that was otherwise obscured - simply that accented syllables lengthen in word-final position to a greater degree than unaccented ones. In any case, this fact does not in itself point to any particular theory of the domain of final lengthening. In sum, experiment 3 supports both word-end and phrase-end as domains for final lengthening. The behavior of the first two speakers points to the need for word-end as domain, while the explanation for the different final lengthening behavior of the other four speakers in experiments 2 and 3 supports making a distinction between word-end and phrase-end. Only two of the theories of the domain offinallengthening can accommodate these data: the syntactic structure theory or the syntactically constrained prosodic structure theory. Finally, let us consider Beckman and Edwards's experiment 1, which involved the sentences in (7) (7)
a. b.
(i) (ii) (i) (ii)
Pop, opposing the question strongly,... Poppa, posing the question strongly,... Pop opposed the question strongly,... Poppa posed the question strongly,...
In the (a) sentences the subject noun phrases constitute an Intonational Phrase, separated off from the following Intonational Phrase constituted by the parenthetical expression. On the other hand, the subject noun phrases belong to the same IP as the following words in the (b) sentences. There is a lengthening effect after the subject noun phrase in both the (a) and (b) sentences. It is smaller in (b) than in (a), where an intonational phrase-end coincides with the noun phrase end. The conclusion drawn by Beckman and Edwards, one with which I would certainly concur, is that the intonational phrase-end is also a domain for final lengthening. Does this intonational phrase-end lengthening choose in favor of any general theory of the domains of final lengthening? It is of course consistent with a 194
On the nature of prosodic constituency
prosodic structure approach. But is a syntactic structure approach entirely excluded ? Certainly the intonational phrase is not itself a syntactic constituent in any typically understood sense of the term. The one property that makes it seem potentially more syntactic-y than other phrases of prosodic structure is that it seems to be more closely tied in to the semantic properties of sentences, and, in traditional parlance, to constitute something like a "sense-unit." In earlier work (Selkirk 1984), I myself have referred to the intonational phrase as an "honorary constituent of the syntax," and have been reluctant to accept the existence of a prosodic structure in general solely on the basis of the existence of intonational phrases. In summary, then, it seems that the data on final lengthening that Beckman and Edwards present are consistent with two different theories of the domain of final lengthening - syntactic structure, or syntactically constrained prosodic structure. Further work must now be done to sort out the differing predictions made by these two theories: the differences are bound to be subtle. Whether one can isolate differences or not may depend on whether the prosodic structure of English is in any way autonomous of the syntactic constraints involved in its assignment, as I have suggested may be the case in Japanese.
10.4 Postscript: a grammatical representation of constituent-sensitive timing ? The preceding discussion has centered on the question of the domain of final lengthening. This question is actually independent of what might be called the mechanism of final lengthening, and it is to this issue that I would like to address a few final remarks. Most work on final lengthening has seen it as an "effect," produced in the phonetic implementation of a sentence on the basis of some properties of the linguistic representation of the sentence, above all some aspect of its hierarchical structure (e.g. Lehiste 1973a, 1973b; Klatt 1975, 1976; Cooper and Paccia-Cooper 1980). In earlier work (Selkirk 1984, 1985) I propose instead that there is a " grammaticization " of constituent-sensitive timing. The hypothesis offered is that final lengthening is produced not by direct reference to a syntactic or prosodic structure, but rather by reference to a linguistic representation which I refer to (Selkirk 1985) as a "virtual pause" structure. This virtual pause structure, itself established with respect to a hierarchical constituency of the sentence (be it syntactic or prosodic), is claimed to be realized asfinallengthening and/or pausing. As an entity of the representation, a virtual pause may have a role in other aspects of phonetic implementation as well. Two arguments were presented in Selkirk (1984) for a linguistic representation of virtual pauses. The first was based on the possibility it affords of making sense out of the relation between final lengthening and pausing, two constituent-
ELISABETH SELKIRK
sensitive durational phenomena which seem not to be independent. The second is based on the idea that patterns of tempo-based variation in the applicability of certain rules of external sandhi might be better understood if these rules were seen as rules of phonetic implementation with temporal adjacency requirements on the segments involved, that temporal adjacency being characterized in terms of a virtual (not real) pause structure. I will not repeat these arguments here. They are laid out in Selkirk (1984: chapter 6) and still call out for experimental verification. In Selkirk 1984 the idea of virtual pause structure was articulated as follows. Silent demibeats, units at the lowest level of the metrical grid not associated with any syllable (cf. Liberman 1975), were assigned by rules to the ends of syntactic constituents. These silent grid positions were translated into a virtual pause structure, whose precise character was said to be determined by tempo and other rhythmic factors as well. It seems necessary to reject the idea that silent demibeats have this sort of role in the grammar. Certainly, if silent demibeats are the timing equivalent of moras (Prince 1983), then the pauses and lengthenings predicted by the presence of multiple silent demibeats are much too large. The alternative is to see virtual pauses as being established directly with respect to a constituency, whether syntactic or prosodic. This alternative retains the idea of Selkirk (1984) that there is a grammatical representation of "virtual pause," an abstract representation of a durational scheme, and that final lengthening, and pausing, are but realizations of that scheme. It is this idea that still needs to be addressed in investigations of constituent-sensitive durational phenomena.
196
On the nature of prosodic constituency
Appendix (i) Left-branching subject
Speaker NH
Speaker KT
(iv) Locative, branching subject
(ii) Right-branching subject
(iii) Branching subject, object
-380 -340 -300 -260 -220 -180 -140
-320 -280 -240 200 -160 -120
380 -340 300 -260 -220 180 140
190
200
200
160
170
-170
130
140
-140
100
110
-110
(vi) Subject, indirect object, direct object
(v) Subject, branching object
Speaker NH
-320 -280 240 -200 -160 -120
-320
-200
-200
-170
-170
-140
-140
-110
-110
280 -240 -200 160 -120
Figure 10.4 Fo contours for NH and KT for accented sentences (cf. figure 10.2). See figure 10.1 for syntactic structures. NH exhibits binarity constraint; KT does not.
197
ELISABETH SELKIRK (i) Left-branching subject
(ii) Right-branching subject
(iii) Branching subject, object
Speaker NH
Speaker KT -100
(iv) Locative, branching subject
Speaker NH
(vi) Subject, indirect object, direct object
(v) Subject, branching object -280
-240
-240
-240
-200
-200
-160
-160
-120
-120
-170
-170
-200
-140
140
-110
M10
-200 -160 -120
Speaker KT
-170 -140
Figure 10.5 Fo contours for unaccented sentences (cf. figure 10.3).
198
-110
On the nature of prosodic constituency Notes 1 Further inadequacies of the boundary theory of domains are enumerated in McCawley (1968), Rotenberg (1978), Selkirk (1978, 1980). 2 Hyman, Katamba and Walusimbi (1987) present a case from Luganda which is incompatible with the view that the representation of domains is a well-formed tree. It is unfortunately not possible for me to evaluate the case in the present context. It will be seen below that evidence from Japanese supports the tree characterization of domains. 3 It doesn't seem unlikely that the choice of end - Left vs. Right - is fixed once and for all for a particular language. That is, it is probable that the relevant end for word will be the end for phrase in language X. 4 Actually, for the only analysis possible here to be one in which function words group with a preceding lexical word it is necessary to assume that the prosodic structure assigned to a sentence is the minimal structure consistent with the syntactic and prosodic constraints that are imposed universally, or by the grammar of the particular language. Otherwise a function word following a lexical word in Japanese might just happen to form a PWd on its own: the parameter (3) only says where a PWd limit must be located, not where it must not be located. Moreover, I should say that the examples of sets A and U probably do not illustrate the grouping of function words with preceding words. For such examples, see McCawley (1968) and Kohno (1980). The case particles of Japanese seen in the A and U sets are quite likely simply nominal suffixes. 5 An alternative approach to the minor phrasing of the U set, which does not employ incorporation, would be to see the initial mapping parameter as requiring that a phonological word (minor phrase) edge coincide only with the left edge of accented lexical words. With this approach as well, the major phrasing at the XP level would impose, top down, the minor phrase boundaries at the left edges of MaP that not required by the word parameter itself. The only other place that a minor phrase boundary would occur would be at the left edge of an accented word, where this alternative mapping would require it. 6 For Selkirk (1984) the final lengthening effect is mediated by a representation of "virtual pause," derived from "silent" demi-beats of a metrical grid. (See section 4 below for further discussion of a linguistic representation of this timing effect.) The other researchers mentioned conceive of the syntax as being directly accessed in speech production by the process giving rise to final lengthening.
References Aronoff, M. and M.-L. E. Kean. (eds.) 1980. Juncture. Saratoga, CA: Anma Libri. Beckman, M. and J. Pierrehumbert. 1986. Intonational structure in Japanese and English. Phonology Yearbook 3. Booij, G. 1985. Principles and parameters in prosodic phonology. Linguistics 21: 275-289. Chen, M. 1987. The syntax of phonology: Xiamen tone Sandhi. Phonology Yearbook 4. Cooper, W., and J. Paccia-Cooper. 1980. Syntax and speech. Cambridge, MA: Harvard University Press. Fretheim, T. E. 1981. Nordic Prosody Vol. 2. Trondheim: Tapir. Hale, K., and E. Selkirk. 1987. Government and tonal phrasing in Papago. Phonology Yearbook 4. 199
ELISABETH SELKIRK
Hayes, B. 1984. The prosodic hierarchy in meter. In P. Kiparsky and G. Youmans (eds.) (forthcoming). Hattori, Shiro, 1949. "Bunsetu" to akusento. (Originally published in two parts in Minzoku to hoogen, nos. 3—4. Reprinted in Hattori, Shiro. 1960. Gengogaku no hoohoo. Tokyo: Iwanami Shoten.) Hyman, L., F. Katamba, and L. Walusimbi. 1987. Luganda and the strict layer hypothesis. Phonology Yearbook 4. Kiparsky, P. and G. Youmans, eds. (forthcoming) Perspectives on Meter. New York: Academic Press. Klatt, D. 1975. Vowel lengthening is syntactically determined in a connected discourse. Journal of Phonetics 3: 129-140. 1976. Linguistic uses of segmental duration in English: acoustic and perceptual evidence. Journal of the Acoustical Society of America 59: 1208-1221. Kohno, T. 1980. On Japanese phonological phrases. Descriptive and Applied Linguistics 13: Tokyo: ICU. Lehiste, I. 1973a. Rhythmic units and syntactic units in production and perception. Journal of the Acoustical Society of America 54. 1973b. Phonetic disambiguation of syntactic ambiguity. Glossa 7: 107-122. McCawley, J. 1968. The Phonological Component of a Grammar of Japanese. The Hague: Mouton. Miyara, S. 1980. Phonological phrase and phonological reduction. PIJL 7: 79-121. Nespor, M., and I. Vogel. 1982. Prosodic domains of external Sandhi rules. In H. van der Hulst and N. Smith (eds.). Nespor, M., and I. Vogel. 1986. Prosodic Phonology. Dordrecht: Foris. Pierrehumbert, J. 1980. The phonology and phonetics of English intonation. Ph.D. dissertation, MIT. Pierrehumbert, J., and M. Beckman. (1988). Japanese Tone Structure. Linguistic Inquiry Monograph. Cambridge: MIT Press. Poser, W. 1984. The phonetics and phonology of tone and intonation in Japanese. Ph.D. dissertation, MIT. Rotenberg, J. 1978. The syntax of phonology. Ph.D. dissertation, MIT. Selkirk, E. 1978. On prosodic structure and its relation to syntactic structure. In T. Fretheim (ed.). 1980a. Prosodic domains in phonology: Sanskrit revisited. In M. Aronoff and M.-L. Kean (eds.). 1980b. The role of prosodic categories in English word stress. Linguistic Inquiry 11(3): 563-605. 1984. Phonology and Syntax: The Relation Between Sound and Structure. Cambridge, MA: MIT Press. 1985. Juncture as a rhythmic phenomenon. Paper presented at the time 4th International Phonologietagung, Einstadt, 1984. 1986. Derived domains in sentence phonology. Phonology Yearbook 3. Shih, Ch-L. 1985. The prosodic domain of tone Sandhi in Chinese. Ph.D. dissertation, UCSD. Terada, M. 1986. Minor phrasing in Japanese. MS, University of Massachusetts, Amherst. van der Hulst, H., and N. E. Smith. 1982. The Structure of Phonological Representations (Part /), Dordrecht: Foris.
11 Lengthenings and the nature of prosodic constituency. comments on Beckman and Edwards's paper CAROL A. FOWLER
Introduction As Beckman and Edwards point out, independent lines of research identify two apparently distinct durational phenomena in speech utterances. Pre-boundary lengthening occurs at the far edge of a domain that, until recently, was identified as syntactic (e.g. Klatt 1975; Cooper and Paccia-Cooper 1980; but see Gee and Grosjean 1983). Stress-timed shortening (as Beckman and Edwards call it) shortens a syllable when there are many of them in a stress foot as compared to when there are few (e.g. Pike 1947/1968). This effect is generally measured by comparing the duration of the stressed syllable in a foot as a function of the total number of syllables in the foot (e.g. Lehiste 1973). Conventional accounts of this phenomenon ascribe it to a rhythmical constraint by which some languages attempt to maintain isochrony between stressed syllables (Abercrombie 1967).l Possibly these two classes of effect are as independent and unrelated as their treatments in the literature suggest. One is described as lengthening, the other as shortening; one occurs at the edge of a domain, the other seems to have to do with prominence peaks in a domain more than with domain edges; by some accounts the domains for the one are syntactic, while those for the other are prosodic or metrical. Alternatively, however, these differences may be more apparent than real. Stressed syllables in monosyllabic stress feet that are identified as unshortened in a stress-timing account can just as well be identified as lengthened; so the absence of stress-timed shortening can instead be described as a lengthening of a stressed syllable at the right edge of a foot. As such, it may count as a sort of preboundary lengthening. Moreover, if Gee and Grosjean (1983) are correct, domains for pre-boundary lengthening are metrical rather than immediately syntactic. If so, the kinds of domains at the edges of which pre-boundary lengthening occurs are the same as the kind in which stress-timed shortening is observed (i.e. both are prosodic). Accordingly, the two effects are possibly of the same general sort, but they occur at different levels of a hierarchy of prosodic domains. Their separate
CAROL A. FOWLER
treatments in the literature may be accidents of their different investigatory methodologies and personnel. Beckman and Edwards have, in my opinion, identified an intriguing puzzle in the literature on durational variation in speech - one that my colleagues and I have also just begun to address (Rakerd, Sennett and Fowler, 1987). Why would languages allow distinct timing effects that, at least hypothetically, could work against each other (in the sense that one may shorten segments that the other lengthens)? Perhaps they don't. Beckman and Edwards describe three experiments that they designed, they suggest, to distinguish the two timing effects and to locate them in a prosodic hierarchy. My first set of comments concerns the research.
Some comments on the research 11.1
Distinguishing between pre-boundary lengthening and stress-timed shortening
My assessment of the research designs adopted by Beckman and Edwards is that they do not permit them either to distinguish stress-timed shortening from preboundary lengthening or to determine how stress-timed shortening fits, if at all, into the set of prosodic tiers they examine. The reason is that their crucial comparisons invariably include target stressed syllables followed by exactly one unstressed syllable (in experiments 1 and 2: "Pop opposed" versus "Poppa posed"; in experiment 3, "swper" versus "Sioux perspective"). By most accounts, in these stimulus materials, the target stressed syllable invariably occurs in a disyllabic stress foot.2 With stress feet always the same size, stress-timed shortening has not been manipulated - that is, it should be the same everywhere — and so it cannot be observed. In places, Beckman and Edwards make clear that this was an intended feature of their design: "Different verbs followed the target words in order to maintain a constant inter-stress interval" (p. 159); "the corpus was designed expressly always to have exactly one unstressed syllable between the stress on the target syllable on the noun and the stress on the following verb." (p. 162) In that case, however, they should not characterize their research aim as an attempt to "distinguish these effects [pre-boundary lengthening and stress-timed shortening] and to locate them at some level of the prosodic hierarchy" (p. 153), because they have no way to determine whether, in addition to the durational variation they do see, they would also have seen any stress-timed shortening. Nor, of course, can they locate an effect that they cannot observe in a prosodic hierarchy. To ask whether there are distinct effects of pre-boundary lengthening and stress-timed shortening effects, it is necessary to design the stimulus materials so that each effect has a chance to manifest itself in a way that can be identified. Rakerd, Sennett and Fowler (forthcoming) have done that in a preliminary way by 202
Lengthenings and the nature of prosodic constituency
looking at mono- and disyllabic stress feet in sentences in which the target stressed syllable of the stress foot is followed either by a word boundary alone or by a word boundary that coincides with a syntactic boundary (a noun phrase/verb phrase [NP/VP] boundary).3 Pre-boundary lengthening can be assessed by comparing durations of stressed syllables in a monosyllabic (or disyllabic) stress foot that are followed by the two different boundary types. Stress-timed shortening can be assessed by comparing the durations of target stressed syllables in a monosyllabic versus a disyllabic stress foot both followed by a word boundary (or both by a NP/VP boundary). Following earlier findings by Klatt (1975) and Cooper and Paccia-Cooper (1980), we found that target words before a NP-VP boundary were longer (by about 40 ms.) than words within a syntactic phrase. In addition, target stressed syllables were 7-8 ms. shorter in disyllabic as compared to monosyllabic stress feet. Both effects were statistically significant in two experiments, one including a manipulation of speech rate and the other of sentence length. Remarkably, in both experiments, the effects of the boundary and of stressfoot size were independent. That is, whether or not a target stressed syllable had been lengthened at an NP boundary, it was shortened by 7-8 ms. by a following unstressed syllable. The occurrence of lengthening at the boundary did not disrupt or even diminish the shortening effect.4 We had expected to find stress-timed shortening eliminated across any boundary at which lengthening occurred. That this did not happen suggests that shortening and lengthening effects are distinct. Indeed, they are distinct enough that they may not even be hierarchically related (that is, they are not, apparently, nested); the intervals over which shortening occurs may span a boundary at which lengthening occurs.
11.2
Making the most informative comparisons
Although, as I have indicated, I do not think that Beckman and Edwards have achieved their stated aims, their research does help to uncover the kinds of domains that exhibit pre-boundary lengthening at their edges. In this regard, however, I have two complaints, not about the research itself, but about its presentation. I am not certain that the authors focus on the most informative comparisons that their data offer, and I wish that they had not avoided setting their research in the context of previous work. As to the first complaint, in experiments 1 and 2, for example, Beckman and Edwards's figures plot durations of /a/ in "Pop" and in "Poppa" in sentences that are matched in the vicinity of the target words in their intonational or accentual phrasing. These figures reveal word-final lengthening where it occurred in the various sentences of the experiments. However, I am not sure why this is a particularly interesting effect to examine. Findings in these experiments and elsewhere (Oiler 1973; Klatt 1975) suggest to me that word-final lengthening is 203
CAROL A. FOWLER
fairly ubiquitous. In any case, the question whether pre-boundary lengthening occurs differentially at the ends of intonational phrases requires a different comparison - one that holds effects of word-final lengthening constant and asks whether there is additional lengthening at the edges of larger constituents. The possibility of additional lengthening would be revealed most clearly in figures and analyses that compared the duration of /a/ in "Pop" (or in "Poppa") at the end of an intonational or accentual phrase against its duration within the phrase. Beckman and Edwards's experiments are designed to permit these comparisons, but surprisingly (at least from my perspective), they are not made directly. In experiment 1, they find significant final lengthening of the same vowel when it is intonational-phrase internal. They remark that the lengthening effect appears smaller and less consistent when it is phrase internal. Possibly, then, the former lengthening effect mixes word-final lengthening with additional lengthening at the intonational phrase boundary. As for my second complaint, I am curious how the findings of the experiments reported by Beckman and Edwards fit into the context of other related findings already in the literature. In two places in their paper, they appear to dismiss at least some of this earlier work on either pre-boundary lengthening or stress-timed shortening. (Perhaps to be tactful, they do not specify the researchers who are guilty of the errors they identify, and so I am not certain whose work they intend to dismiss.) They suggest that "in many experiments reported in the literature, the two effects [pre-boundary lengthening and stress-timed shortening] are in fact confounded" (p. 153). And: "a large part of the difficulty with earlier experiments stems from a mistaken assumption that questions about the phonetic characterization of stress and its relation to phrasing and intonation can be addressed pretheoretically. The unexamined phonological theory that is only implicit in the experimental design then confounds the different possible interpretations" (pp. 153-4). Beckman and Edwards do not confound pre-boundary lengthening and stresstimed shortening, but, as I indicated earlier, they do not manipulate stress-timed shortening either. Accordingly, their work should be comparable to other research on pre-boundary lengthening in which effects of stress-timed shortening were held constant - the work of Cooper and Paccia-Cooper (1980), for example. The findings of these researchers are of interest because they are interpreted as support for a theory that the boundaries at which lengthening occurs are immediately syntactic. I do not think that the second criticism that Beckman and Edwards level against earlier work is relevant to the work of Cooper and Paccia-Cooper either, because from Cooper and Paccia-Cooper's perspective, they were examining syntactic influences on lengthening and pausing, not phonological influences. Beckman and Edwards may be reluctant to tackle Cooper and Paccia-Cooper's collection of findings, because, most likely, being unaware of the possible importance of 204
Lengthenings and the nature of prosodic constituency
metrical structure to lengthening effects, Cooper and Paccia-Cooper confounded manipulations of metrical and syntactic structure. But, then, Beckman and Edwards allowed the syntax of their sentences to vary as they manipulated metrical phrasing, and so there may be confounding in their work, too. It would be useful to have syntactic and metrical variables unconfounded. But in the mean time, it would also be useful to know whether and how, in Beckman and Edwards's view, Cooper and Paccia-Cooper's theory might accommodate their own findings and whether and how Beckman and Edwards might handle effects that Cooper and Paccia-Cooper ascribe to unmediated effects of syntax on duration. 11.3
Compression limits
A final, minor methodological, comment about the research largely concerns the findings of experiment 3. It is that segments have compression limits - limits that is, on how much they can shorten (e.g. Klatt 1976). Effects of distinct sources of shortening that may add at longer durations of the segment no longer do so in the vicinity of the compression limit. /u/, the target stressed vowel of experiment 3, is short compared to the /a/ used in experiments 1 and 2. Possibly, where it is unaccented, it is already close to its compression limits for most speakers, especially at the fast and normal rates. Perhaps the different durational patterns shown by the same talkers on the different sentences of experiments 2 and 3 can, in part, be ascribed to this factor selectively at work in experiment 3.
Finally, a question A question this research raises in my mind (as does other research; see my comments on Patricia Keating's paper) is how we are to know which of the systematicities we observe in the articulatory behaviors of talkers or in their acoustic products should count as systematicities to which any component of a grammar - here a metrical phonology, for example - should address itself. The answer, evidently, is not "all of them." For example, as talkers speak, their lungs deflate in a systematic way, but we are not inclined to write a grammar in such a way that this systematicity is predictable. Nor should any of the acoustic correlates of this event be considered grammatical. That is, if the amplitude of the acoustic signal for an utterance declines with the falling subglottal pressure from the deflating lungs (as data from Carole Gelfer's 1987 thesis suggest), the grammar should not generate that systematicity either. Nor, to the extent that declination in fundamental frequency also follows the falling subglottal pressure (Gelfer, Harris, and Baer 1987) should it (that part of it so explained anyway) be treated as part of the phonology or predictable from the phonology. How can we know or how can we decide whether pre-boundary lengthening effects should be predictable from a description of the grammatical or phonological 205
CAROL A. FOWLER
structure of a language, from the syntactic structure or whether it shouldn't be predictable from the grammar at all ? Pre-boundary lengthening provides one hint that it is not a wholly phonologically-determined property of an utterance. In particular, that segments lengthen, rather than shorten, at boundaries may not be accidental. The lengthening may reflect the braking that inertial systems show generally as they stop gently. Is it possible that the tiered lengthenings reflect tiered stoppings as speech is produced ? If so, then the edges at which lengthenings occur signal the constituency of "output units" for the speech system (cf. Sternberg, Wright, Knoll and Monsell 1980; Monsell 1986). Should we expect these units to be phonological constituents ?
Notes Preparation of the manuscript was supported by NICHD Grant HD01994 and NINCDS Grant NS13617 to Haskins Laboratories and by a Fellowship from the John Simon Guggenheim Foundation. 1 At least one other account has been proposed, promoted in part by the observation that, as an attempt at isochrony of inter-stress intervals, stress-timed shortening is markedly half-hearted (e.g. Fowler 1977, 1981). I have speculated that it may not really reflect a duration controlling phenomenon at all, but rather may be tied to coarticulatory patterning. 2 In the traditional literature on stress timing, a stress foot is identified as a stressed syllable followed by as many unstressed syllables as occur until the next stress, even if word boundaries intervene between the stressed syllable and its following unstressed syllables (e.g. Abercrombie 1964). A basis for this definition is the occurrence of the shortening effect under consideration here (e.g. Lehiste 1973), which may be reduced, but is not eliminated by a word boundary (e.g. Fowler 1977; Huggins 1975, 1978). In contrast to this, in versions of metrical phonology in which stress feet are constituents (e.g. Selkirk 1980), word boundaries do delimit stress feet. However, unless I have missed it, this constraint is undefended. 3 We chose a syntactic boundary not because we intended to ally ourselves with the view that lengthening is conditioned by the syntax directly, but only because, for whatever metrical or syntactic reasons, these boundaries are associated with lengthening (Klatt 1975; Cooper and Paccia-Cooper 1980). 4 A different account for the "shortening" effects in these experiments might be that, instead, they reflect lengthening of the target syllable in a monosyllabic stress foot due to "stress clash" with the following stressed syllable (Liberman and Prince 1977). We discount this explanation for two reasons. First lengthening at the NP-VP boundary should have helped to alleviate the clash, but we find equivalent shortening across that boundary as across a word-boundary within a syntactic phrase. Second, evidence to date has failed to uncover evidence that talkers respond in other unexpected ways to putative stress clashes. That is, they fail to retract stress on words such as "thirteen" and " Pennsylvania " from the second stressed syllable to the first in the context of following stress-initial words (Cooper and Eady 1986).
206
Lengthenings and the nature of prosodic constituency References Abercrombie, D. 1964. Syllable quantity and enclitics in English. In D. Abercrombie, D. B. Fry, P. A. D. MacCarthy, N. C. Scott and J. L. M. Trim (eds.) In Honour of Daniel Jones. London: Longman, 216-222. 1967. Elements of General Phonetics. Chicago: Aldine. Cooper, W. and S. Eady. 1986. Metrical phonology in speech production. Journal of Memory and Language 25: 369-384. Cooper, W. and J. Paccia-Cooper. 1980. Syntax and Speech. Cambridge, MA: Harvard University Press. Fowler, C. A. 1977. Timing Control in Speech Production. Bloomington: Indiana University Linguistics Club. 1981. A relationship between coarticulation and compensatory shortening. Phonetic a 38: 35-50. Gee, J. P. and F. Grosjean. 1983. Performance structures: a psycholinguistic and linguistic appraisal. Cognitive Psychology 15: 411—458. Gelfer, C. 1987. A simultaneous physiological and acoustic study of fundamental frequency declination. Ph.D. dissertation, CUNY. Gelfer, C , K. Harris, and T. Baer. 1987. Controlled variables in sentence intonation. In T. Baer, C. Sasaki and K. Harris (eds.) Vocal Fold Physiology: Laryngeal Function in Phonation and Respiration. Boston: College-Hill Press, 422-435. Huggins, A. W. H. 1975. On isochrony and syntax. In G. Fant and M. Tatham (eds.) Auditory analysis and perception of speech. New York: Academic Press. Huggins, A. W. F. 1978. Speech timing and intelligibility. In J. Requin (ed.) Attention and Performance 7. Hillsdale, NJ: Lawrence Erlbaum Associates, 279-297. Klatt, D. 1975. Vowel lengthening is syntactically determined in connected discourse. Journal of Phonetics 3: 129-140. 1976. Linguistic uses of segment duration in English: acoustic and perceptual evidence. Journal of the Acoustical Society of America 59: 1208—1221. Lehiste, I. 1973. Rhythmic units and syntactic units in production and perception. Journal of the Acoustical Society of America 54: 1228-1234. Liberman, M. and A. Prince. 1977. On stress and linguistic rhythm. Linguistic Inquiry 8: 249-336. Monsell, S. 1986. Programming of complex sequences: evidence from the timing of rapid speech and other productions. In H. Heuer and C. Fromm (eds.) Generation and Modulation of Action Patterns. Experimental Brain Research Series 15. NewYork: Springer-Verlag. Oiler, D. K. 1973. The effect of position in utterance on speech segment duration in English. Journal of the Acoustical Society of America 54: 1235-1247. Pike, K. 1947. Phonemics. Ann Arbor: University of Michigan Press. Eleventh printing, 1968. Rakerd, B., Sennett, W. and C. Fowler. (1987). Domain-final lengthening and foot-level shortening. Phonetica 44: 147-155. Selkirk E. 1980. The role of prosodic domains in English word stress. Linguistic Inquiry 11: 563-605. Sternberg, S., C. Wright, R. Knoll and S. Monsell. 1980. Motor programs in rapid speech: additional evidence. In R. A. Cole (ed.) The Perception and Production of Fluent Speech. Hillsdale, NJ: Lawrence Erlbaum Associates.
207
12 From performance to phonology: comments on Beckman and Edwards's paper ANNE CUTLER
Beckman and Edwards have presented experimental evidence for the association of syllabic lengthening in speech with two phonological effects. One is the presence of an intonational phrase boundary. Lengthening associated with intonational phrase boundaries occurs consistently across speakers and speech rates. The other effect is the presence of a word boundary. In contrast to phrase-final lengthening, word-final lengthening occurs inconsistently. It is more evident at a slow rate of speech; even then, not all speakers show it with all sentences. Beckman and Edwards's aim in examining the occurrence of lengthening effects is to make claims about the proper model of English phonological structure. They relate their experiments to a long tradition of phonological interpretation of timing effects in speech, with two separate (and as Beckman and Edwards point out, at least in part incompatible) strands: lengthening which is claimed to accompany the terminal boundary of a phonological unit, and shortening which is claimed to adjust the interval between stressed syllables in the direction of greater regularity. The effects which Beckman and Edwards found in their experiments are of the first sort. This commentary concerns the assumptions on which undertakings such as Beckman and Edwards's are based. Precisely how does a speaker's performance shed light on the phonology of a language ? Two useful distinctions may be drawn. The first is between what one may call a strong and a weak version of the claim to "psychological reality." In the strong version, the claim that a particular level of linguistic analysis X or postulated process Y is psychologically real implies that the ultimately correct model of human language processing will include a level of representation corresponding to X or a mental operation corresponding to Y. In the weak version, "psychological reality" implies only that language users can draw on knowledge of their language which is accurately captured by the linguistic generalization in question. (Note that for certain linguistic constructs this latter claim is equivalent to no more than the constructs descriptive adequacy; for example, the intuitions which language 208
From performance to phonology
users are predicted to entertain by the weak reading of " psychological reality of the phoneme " are the same distributional data which led to the postulation of the construct phoneme in the first place.) The second useful distinction is between the processes of language production and language perception; or rather, between the relevance of phonological structure to production and to perception respectively. Crudely speaking, one might think of phonological structure exercising a particular effect on language production because the structure is relevant in some way (strongly or weakly) to the process of language production. Alternatively, one might imagine the same effect appearing in production simply because speakers want to make sure that the structure is correctly interpreted by hearers. In this case the presence of the effect in question can be said to be constrained by the nature of the process of perception rather than the process of production, in the sense that the speaker's behavior is governed by a model of the hearer. Some examples should make this second distinction clearer.
Production as a motivation Cooper and Paccia-Cooper (1980) carried out one of the most extensive investigations of temporal (and other phonological) effects associated with syntactic structure. Their stated aim was "to make use of temporal phenomena [to] arrive at the form of a speaker's grammatical code" (p.9). Among other effects, they found much evidence of segmental lengthening at syntactic boundaries; their preferred explanation of this phenomenon is that "lengthening effects are produced in order to allow the speaker an extra fraction of time for planning upcoming material in the next phrase. This planning may involve not only semantic and syntactic computations but also the resetting of ... articulatory postures" (p. 199). Since this explanation cannot account for why lengthening also occurs in utterance-final position, Cooper and Paccia-Cooper offer as a subsidiary explanation the possibility that "The lengthening effect could be attributable to execution ... [and] represents a general relaxation response ... The internal clock, in effect, runs more slowly at the ends of major constituents, presumably due to processing fatigue" (p. 199). Cooper and Paccia-Cooper point to some independent supporting evidence for each of these claims, and discuss and dismiss alternative explanations. Each of these mechanisms whereby segmental lengthening effects might arise is purely internal to the process of speech production. The syntactic structure around which Cooper and Paccia-Cooper built their investigations is, according to their explanations, the framework within which speech is produced. Speakers speak clause by clause and phrase by phrase; when they need extra time for planning the next unit of their utterance, or when they reach the end of a planning unit and are able to relax, the unit in question is a syntactic one. Although Cooper and Paccia-Cooper were concerned in their investigations 209
ANNE CUTLER
exclusively with syntactic structure, and couched their production-based account in syntactic terms, there is of course no reason why the argument could not equally well be applied, mutatis mutandis, to phonological structure. Speakers speak phonological phrase by phonological phrase, require planning time prior to the next phonological phrase, relax at the end of a phonological phrase.
Perception as a motivation Lehiste (1977) made a powerful argument for the perceptual origins of lengthening effects in speech. She reviewed evidence showing that listeners judge speech to be more rhythmically regular than it actually is, and, in effect, "expect isochrony" (p. 253). Speakers can trade on this expectation by using manipulation of speech rhythm to communicate information to listeners. For instance, they can signal the presence of a syntactic boundary by lengthening the interval between two stress beats on either side of the boundary. Scott (1982) provided confirming evidence that listeners' use of lengthening as a cue to the presence of a phrase boundary is not, in Beckman and Edwards's words, a direct decoding of syntax but is mediated by phonological structure in the form of inter-stress interval ratios. It is suggested, then, that lengthening associated with syntactic structure is produced by the speaker not for any production-internal reason such as the need for planning time or the relaxation of articulatory gestures, but for the benefit of the listener's perception of structure within the utterance. Note that invoking perceptual motivation for lengthening effects in speech production is not a particularly radical claim. Production involves a range of phenomena in which the choices by speakers are constrained by the needs of listeners (see Cutler 1987 for a review of some of these). The construction of a neologism, the way in which a slip of the tongue is corrected (and whether it is corrected at all), the choice of whether or not to obscure a word boundary by allowing the application of elision or assimilation — all of these production processes are constrained ultimately by the requirements of the perceiver. It is surely no more radical to claim that speakers should attempt to make phonological structure clear to their listeners.
Comparing production and perception arguments Naturally, independent evidence can be invoked to motivate either a productionbased or a perception-based account of speech performance phenomena. This was the approach taken by Lehiste (1977), who first reviewed the considerable independent evidence for listeners' over-estimation of rhythmic regularity before constructing her argument about the perceptual motivation for lengthening in production. But it should be noted that there is also independent evidence for the importance of rhythmic regularity in speech production, beyond the very existence
From performance to phonology
of lengthening/shortening effects in normal utterances. For instance, it is a startling fact that slips of the tongue which alter the number of syllables in an utterance (such as omission or addition of syllables) significantly more often result in an utterance more rhythmic than the target utterance would otherwise have been (Cutler 1980). This suggests that there may be a production advantage to regularity of speech rhythm. Here, something which the speaker did not want to say can be viewed as evidence of a speech production effect which is independent of the process of constructing an utterance, to the extent that pressure towards rhythmicity can on occasion interfere with the process of construction. However, one can also invoke evidence which is internal to the phenomenon at issue in order to distinguish between the relative merits of production-based and perception-based accounts. For instance, it may seem obvious that performance effects which have their origin in some characteristic of the production process should necessarily show up in all utterances where they are applicable. Performance effects which result from the speaker's intention to produce perceptual effects in the listener, on the other hand, might well be subject to the speaker's option — for instance, their presence might be argued to be contingent upon the speaker's recognition of the listener's needs. One of the most relevant experiments here is a study by Lehiste (1973), which dealt with syntactically ambiguous sentences. At issue was whether speakers would produce disambiguating cues (temporal cues, among others) which would enable listeners to distinguish between the various readings of an ambiguous utterance. Crucially, Lehiste told her set of speakers that all the sentences were ambiguous only after they had spoken the sentences. Then she (a) recorded which reading the speakers had intended in this first production, and (b) had the speakers produce each sentence again in each of its readings. All productions were then played to listeners who were asked to judge which reading had been intended. The listeners were often unable to tell which reading had been intended when the speakers were unaware of ambiguity. When the speakers were aware of the ambiguity and were attempting to signal one reading rather than another, however, their signals were generally distinguishable. Lehiste's physical measurements of the various productions confirmed that the consciously disambiguated utterances contained marked temporal and other cues to syntactic structure which were absent in the first, unaware, productions. Lehiste's study suggests that disambiguating prosody (of which in her results a very large component was temporal manipulation) is indeed to a considerable extent under the control of speaker awareness, which in turn suggests that it is not produced by factors internal to the production process. Strictly speaking, the mere absence of a particular effect on a particular occasion is not by itself evidence against a production-based account of that effect. Floor effects could for instance result in a temporal effect becoming so small as to be insignificant at fast rates of speech. Alternatively, an independent component of
ANNE CUTLER
the speech production process could interact in an unknown manner with the component producing the temporal effect in such a way that the effect came and went. However, if no such competing process can be independently motivated, and if rate of speech is explicitly controlled, then consistency versus inconsistency of a particular temporal effect is a useful guideline to its genesis. In terms of the strong versus weak distinction of psychological reality, a perception-based account implies only weak psychological reality, in the sense that speakers could not produce phonological distinctions for listeners unless they had a mental representation of the phonological structure available to them. A production-based account generally implies strong psychological reality; for example, Cooper and Paccia-Cooper's account is based on the assumption that speech is planned in chunks which are syntactically defined, so that the syntactic structure is present as a level of representation in the process of speech production. It is conceivable, however, that a production-based account could entail only weak psychological reality - for instance, if it only required reference to a mental representation of phonological structure itself independent of the production process, rather than requiring that phonological structure be a necessary component of one or more stages of processing.
Assumptions underlying Beckman and Edwards's arguments We come now to the present argument from performance to phonology. Beckman and Edwards are interested in the proper model of English prosodic phonology, and the way in which temporal effects in speech performance might shed light on this. Precisely what assumptions underlie their conception of the relationship between the phonology and the performance ? Beckman and Edwards do not explicitly state the reasons why they consider that performance sheds light on phonology. However, there are certain clues to their assumptions in their text. Let us concentrate on the case of word-final lengthening, which is, after all, both the major new contribution of Beckman and Edwards' paper and the primary puzzle which is still left at its end. On the one hand, Beckman and Edwards speak of word-final lengthening as being a "boundary mark" (p. 164), which "marks accentual phrases" (p. 170). All of these wordings suggest that Beckman and Edwards consider word-final lengthening, at least, to be under the control of speaker option, and to be used for marking structural boundaries for the sake of the listener. On the other hand, Beckman and Edwards also speak of a constituent "triggering" word-final lengthening (p. 170), which suggests a strong production-based motivation. In fact, the results of the various experiments support different conclusions about word-final lengthening. In experiments 1 and 2 the word-final effect is robust and reliable, which would at least be consistent with a production-based explanation. In experiment 3, however, only two subjects show simple word-final 212
From performance to phonology
lengthening. This is consistent with an explanation in which presence of the effect depends upon subjects' awareness of the structure with which the effect is associated. As Beckman and Edwards are aware, however, this state of affairs is highly unsatisfactory. Words per se are not necessarily phonological units. Moreover, it is beyond question that literate speakers are aware of words as units of their utterances. Thus there is a distinct possibility that the word-final lengthening effect is simply a direct analogue encoding of a non-phonological effect. This kind of explanation is what Beckman and Edwards are confessedly biased against, and it is understandable that they expend considerable effort on resisting it. Furthermore, since natural speech tends to contain a majority of monosyllabic words (Cutler and Carter, 1987), such an explanation would suggest that perhaps the majority of temporal variation above the segmental level relates only to lexical and not at all to phonological units. Because the primary problem for an interpretation of the word-final effect is the discrepancy between the results of experiments 1 and 2, on the one hand, and experiment 3, on the other, Beckman and Edwards attempt to reach a solution by looking for a crucial difference between the experiments. They themselves point out that their chosen solution, a possible difference in focus structure between the sets of materials in the earlier and later experiments, is both highly speculative and unable to account fully for all aspects of their results. However, there are further characteristics of the particular experimental materials Beckman and Edwards used which may also prove to be relevant. Consider first of all the question of subjects' awareness of structure. The Lehiste (1973) experiment showed that bringing a contrast to speakers' attention heightens the likelihood that the contrast will be marked in performance. The sentences used in all of Beckman and Edwards's experiments left speakers in no doubt of the contrasts under investigation. One cannot therefore rule out the possibility that the temporal distinctions they produced arose from their desire to emphasise the perceived contrast. The opposition was of course clearest in the pairs of sentences used in experiments 1 and 2, in which the same phonetic sequence received different lexical/syntactic parses. Consider further the phonetic structure of these sentences. Speakers were required to articulate three / p / phonemes, each followed by a different vowel, in the crucial region to be measured. It is not particularly far-fetched to suggest that the difficulty of articulating such a sequence with clarity might have been a contributory factor in rendering speakers more prone to produce lengthening effects. Indeed, research on tongue twisters suggests that lengthening (that is, slowing of articulation) is a characteristic response of speakers faced with an articulatorily awkward output (Kupin 1976). Significantly, in experiment 3, in which the crucial regions of the sentences contained no sequences of repeated phonemes, far less word-final lengthening was observed. 213
ANNE CUTLER
Suppose that instead of "Pop" and "Poppa" preceding the "opposed""posed" contrast, Beckman and Edwards had used, for example, "Dad" and "Dadda" (or perhaps, to avoid all repetitions, "Beck" and "Becca"). Would speakers have produced the word-final effects quite so consistently? These speculations are clearly testable. Firstly, one could use materials involving less phonetic repetition. Independently, one could investigate the effect in a situation in which subjects were as far as possible unaware of the contrast to be tested. If the word-final lengthening effect were to disappear as a result of these manipulations, one need look no more for a phonological explanation of it. If it should persist, then it is appropriate to continue the attempt to sort out whether or not it is a true phonological effect, and, orthogonally, whether it reflects characteristics of the process of speech production. My personal bet is that the word-final lengthening effect is both nonphonological and at least partly artifactual. But I hope I am wrong. The most interesting kind of claim about speech performance effects is the strong implementation claim, since it has implications both for linguistic and psychological theory. There is precious little precedent for claims about the role of phonological structure in speech production; I wish Beckman and Edwards had been able to advance unassailable evidence in support of such a claim.
References Beckman M. E. and J. Edwards (this volume) Lengthenings and shortenings and the nature of prosodic constituency. Cooper W. E. and J. Paccia-Cooper. 1980. Syntax and Speech. Cambridge, MA: Harvard University Press. Cutler, A. 1980. Syllable omission errors and isochrony. In H. W. Dechert and M. Raupach (eds.) Temporal Variables in Speech. The Hague: Mouton, 183-190. 1987. Speaking for listening. In A. Allport, D. G. MacKay, W. Prinz and E. Scheerer (eds.) Language Perception and Production: Relationships between Listening, Speak-
ing, Reading and Writing. London: Academic Press, 23-40. Cutler, A. and D. M. Carter, 1987. The predominance .of strong initial syllables in the English vocabulary. Computer Speech and Language 2: 133-142. Kupin, J. 1976. Tongue twisters. Paper presented to the Linguistic Society of America, Philadelphia, December 28-30. Lehiste, I. 1973. Phonetic disambiguation of syntactic ambiguity. Glossa 7: 107-122. 1977. Isochrony reconsidered. Journal of Phonetics 5: 253-263. Scott, D. R. 1982. Duration as a cue to the perception of a phrase boundary. Journal of the Acoustical Society of America 7 1 : 996—1007.
214
13 The Delta programming language: an integrated approach to nonlinear phonology, phonetics, and speech synthesis SUSAN R. HERTZ
13.1
Brief overview
The Delta programming language is designed for formalizing and testing phonological and phonetic theories. Its central data structure lets linguists represent utterances as multiple "streams" of synchronized units of their choice, giving them considerable flexibility in expressing the relationship between phonological and phonetic units. This paper presents Version 2 of the Delta language, showing how it can be applied to two linguistic models, one for Bambara tone and fundamental frequency patterns and one for English formant patterns. While Delta is a powerful, special-purpose language that alone should serve the needs of most phonologists, phoneticians, and linguistics students who wish to test their rules, the Delta System also provides the flexibility of a general-purpose language by letting users intermingle C programming language statements with Delta statements.
13.2
Introduction
Despite their common interest in studying the sounds of human language, the fields of phonology and phonetics have developed largely independently in recent years. One of the contributing factors to this unfortunate division has been the lack of linguistic rule development systems. Such systems are needed to let linguists easily express utterance representations and rules, and facilitate the computational implementation and testing of phonological and phonetic models. SRS (Hertz 1982) is a rule development system that was designed, starting in 1974, for just this purpose - to let linguists easily test phonological and phonetic rules, and explore the interface between phonology and phonetics through speech synthesis. SRS, however, was influenced quite heavily by the theory of generative phonology that was prevalent at the time, a theory that posited linear utterance representations consisting of a sequence of phoneme-sized segments represented as bundles of features (Chomsky and Halle 1968). Although at the phonetic level, SRS uses different "streams" for different synthesizer parameters, the parameter 215
SUSAN R. HERTZ
values and segment durations must all be set in relation to the phoneme-sized segments at the linear phonological level. Thus, while SRS lets users express rules in a well-known linguistic rule notation, and easily change the rules, it forces them to work within a particular framework. Because SRS was biased toward a particular theory of sound systems, we became equally biased in our approach to data analysis and rule formulation. For example, we took for granted that phonemes (more precisely, phoneme-sized units) were the appropriate units for the assignment of durations and formant patterns, a basic assumption that blinded us for years to the possibility of alternative models. As alternatives finally emerged, however, the need for a more flexible system for expressing and testing phonological and phonetic rules became apparent. The clearest requirements for a more flexible rule development tool were a multi-level (or multi-tiered) data structure that could make explicit the relationship between phonological and phonetic units, and a precise andflexiblerule formalism for manipulating this structure. In response to these needs, in July 1983 I began the development of a new synthesis system, the Delta System (Hertz, Kadin, and Karplus 1985; Hertz 1986), in consultation with two computer scientists, Jim Kadin and Kevin Karplus. The Delta System provides a high-level programming language specifically designed to manipulate multi-level utterance representations of the sorts suggested by our rule-writing experience with SRS (Hertz 1980, 1981, 1982; Hertz and Beckman 1983; Beckman, Hertz, and Fujimura 1983). This language lets users write and test rules that operate on multi-level utterance representations without having to take care of the programming details that would be required in an ordinary programming language like C. A one-line Delta statement might easily take a page to accomplish in C. The ease of expressing and reading rules in Delta enables rule-writers to test alternative strategies freely and conveniently. While the move from the linear utterance representations central to SRS to the multilevel representations central to Delta parallels the move by phonologists from linear to nonlinear representations, the Delta System is a direct consequence of our SRS experience, and, unlike SRS, was developed independently of the phonological theories in vogue at the time. The Delta System is flexible enough to let phonologists and phoneticians of different persuasions express and test their ideas, constraining their representations and rules in the ways they, rather than the system, see fit. The system assumes as little as possible about the phonological and phonetic relationships that rule-writers may wish to represent in their utterance representations, and the manner in which their rules should apply, allowing them to make dependencies between rules explicit and giving them full control, for example, over whether their rules should apply cyclically or noncyclically, sequentially or simultaneously, left-to-right or right-to-left, morph by morph or syllable by syllable, to the entire utterance or only a portion thereof, and so on. 216
The Delta programming language
In addition to a sophisticated programming language for building and manipulating multilevel utterance representations, the Delta System provides a flexible window-based interactive development environment. In this environment, users can view in one window the Delta program that is executing, and issue commands in another window to Delta's source-level debugger. The debugger lets users trace their rules during program execution, stop the execution of their program at selected points (e.g. each time the utterance representation or a particular variable changes), display the utterance representation and other data structures, modify the utterance representation "on the fly" (e.g. in a program designed for synthesis, to hear the result of a longer duration for a particular unit), and so on. The interactive environment, like the system in general, is designed for speed andflexibilityin the development of phonological and phonetic rules, letting rule-writers test and modify their hypotheses quickly and easily. The interactive environment is an essential part of the system, but a description of it is outside the scope of this paper. Hertz et al. (1985) describes an early version of the debugger. The debugger has been enhanced substantially since that paper was written. It is now a complete source-level debugger with many more capabilities than those shown in the paper. The Delta System also includes a powerful macro processor with which users can tailor the syntax of Delta language statements and debugger commands to their liking. With macros, users can extend the language and debugger by defining new statements and commands, and avoid repeated typing of long or complex sequences. The Delta System has been designed to be as portable as possible, to give linguists the widest possible access to it. The system compiles Delta programs into C programs, and is designed to run on any computer with a standard C compiler and at least 512K of memory, such as an IBM PC-AT or a Macintosh.1 Compiling into C has the additional advantage that it lets users integrate C programs with Delta programs at will, even intermingling Delta and C code in a single procedure.2 The system is also made accessible through the comprehensive Delta User's Manual, which contains an overview of the system, tutorials, extensive reference sections, and sample programs. The next section of this paper (section 3) presents selected features of Version 2 of the Delta language, introducing many of the concepts needed to understand the sample programs in subsequent sections. It uses examples from Bambara, a Mande tone language spoken in Mali. These examples anticipate the programs in section 4, which illustrate how tone patterns and corresponding fundamental frequency values might be assigned in Bambara. Bambara is chosen because it exhibits many of the properties of tone languages that have motivated multilevel representations in phonology (e.g. tone spreading and floating tones), while at the same time providing good examples of the function of phonological units in determining actual phonetic values (e.g. the role of the tones in determining 217
SUSAN R. HERTZ
fundamental frequency values). Section 5 presents a model of English formant timing that further illustrates Delta's flexibility in accommodating a wide range of theories about the interface between phonology and phonetics. Finally, section 6 presents conclusions, gives a brief overview of the features of Delta not described in the paper, and discusses our plans for enhancing Delta and complementing it with another system in the future. The examples and program fragments in this paper reflect the syntax of Version 2 of the Delta language at the time of writing. In several cases, this syntax differs from (and supersedes) that shown in earlier papers that describe the Delta System, reflecting improvements made as a result of experience with Delta Version 1.
13.3
Selected features of Delta
The Delta programming language is a high-level language designed to create, test, and manipulate a data structure called a delta for representing utterances. A delta consists of a set of user-defined streams of tokens that are synchronized with each other at strategic points. The tokens can represent anything the rule-writer wishes - phrases, morphs, syllables, tones, phonemes, subphonemic units, acoustic parameters, articulators, durations, classes of features, and so on. This section describes the structure of deltas, focusing first on the kinds of relationships that can exist between tokens in different streams, and then on the language for determining and testing these relationships. For most of its sample deltas, this section uses the Bambara phrase muso jaabi "answering the woman," which can be transcribed as [muso ! ja:bf]. In this transcription, the grave accent [ ] represents a low tone and the acute accent ['] a high tone. Thus the first syllable has Low tone, and all other syllables have High tone. The exclamation point represents tonal downstep, the lowering in pitch of the following high tones. This lowering occurs in Bambara after a definite noun. To account for the tonal downstep, linguists, following Bird (1966), posit as a definite marker a "floating Low tone" that occurs after the noun and is not associated with any syllable. Following is a sample delta that a program couched in the framework of autosegmental CV phonology (Clements and Keyser 1983) might build for the phrase muso jaabi: (1)
phrase: word: morph: phoneme: CV: nucleus: syllable: tone :
NP noun root m
Iu
c
| V
|s |
| nuc |
syl 1 L 1
VP
1 1 1 C
verb root a
0
1j
1
V
1 c
1 v 1 v
1 1
1
nuc syl H
syl
nuc
1b 1C 1 1 H 9
2l8
| i | V | nuc syl 10
11
The Delta programming language
This delta consists of eight streams: p h r a s e , word, morph, phoneme, CV, n u c l e u s , s y l l a b l e , and t o n e . The p h r a s e stream has two tokens, NP (noun phrase) and VP (verb phrase); the word stream has two tokens, noun and v e r b ; and so on. The tokens in the CV stream represent abstract timing units, in accordance with CV theory. The long phoneme a is synchronized with two V tokens in the CV stream, while the short vowels are synchronized with a single V. The n u c l e u s stream marks each vowel, regardless of length, as the nucleus of the syllable. The t o n e stream has four tokens, two L (low) tokens and two H (high) tokens, reflecting the tone pattern given in the transcription above. The vertical bars in each stream are called sync marks. Our sample delta has eleven sync marks, numbered at the bottom for ease of reference. Sync marks are used to synchronize tokens across streams. For example, sync marks 1 and 3 synchronize all of the tokens that constitute the first syllable. Sync marks 5 and 6 surround a L tone that is not synchronized with any tokens in any other stream. It represents a floating Low tone that marks the noun phrase as definite, as discussed above. Sync marks 1 and 6 synchronize this floating tone along with the preceding L and H tones of the root with the NP token in the p h r a s e stream. Following the analysis of Rialland and Sangare (1989), sync marks 6 and 11 synchronize a single H tone with two syllables and one root, rather than a separate H tone with each syllable. This delta could be expressed in more familiar autosegmental terms as follows: (2)
b
a / \
phoneme: CV:
V
V
\y
nucleus:
nuc
i
!
I
C
V nuc
syllable: tone: N root NP
V root VP
Note that while the delta representation makes the appropriate associations among tokens using sync marks alone, the autosegmental representation must use brackets in addition to association lines in order to show that the floating Low tone is part of the noun phrase. Furthermore, while the tiers in autosegmental representations are critically ordered with respect to each other in the sense that "tokens" on one tier can only be explicitly linked to tokens on particular other tiers (see, for example, Clements 1985) the streams of a delta can always occur in any order with respect to each other. The same relationships exist between the tokens in different streams 219
SUSAN R. HERTZ
regardless of the order in which the streams are listed. For example, the phoneme, s y l l a b l e , and CV streams in delta (1) above could also be displayed in the following order: (3)
phoneme: syllable: CV:
| m | u | s | o | | syl | syl | | C | V | C | V |
| j | | C
| a syl | V | V
| b | i | | syl | | C | V |
As in this example, deltas from here on will show only those streams relevant to the discussion at hand. In addition to syntactic, morphological, and phonological streams, a delta can also have phonetic streams. For example, a time stream might be added to delta (1) as follows: (4)
phoneme:
s
|
1 j
CV:
c 1 v |
1c
duration:
200
1 o
1 150 |
| 140
1 a I 1 v1 v 1 200
Here, a duration (in milliseconds) is synchronized with each phoneme token. In a delta, unlike in an autosegmental representation, a time stream can be used to time articulatory movements or acoustic patterns with respect to phonological units like phonemes. In the following delta fragment, a time stream is used to time F o targets with respect to the syllable nuclei: (5)
phoneme: |s | nucleus: | | tone: . . . | FO: | duration: 1200 | 75
o nuc H | 150 | | 0 | 75
I |
I0 I & | | nuc |L | H | | | 130 | | 1140 | 100 | 0 | 100
I | | | |
An F o target is placed halfway through each nucleus. The targets themselves have no duration, being used only to shape the F o pattern, which moves from target to target in accordance with the specified durations. A program designed for actual synthesis could interpolate the values (transitions) between the targets and send them, along with the values for other synthesizer parameters, to the synthesizer.
13.3.1 Token structure Each token in a stream is a collection of user-defined fields and values. Each token has at least a namefield.It is the value of the name field that is displayed for each
The Delta programming language
token in the deltas shown above. Tokens can be given other fields as well. For example, a phoneme token named m might have the following fields and values: (6)
name: place: manner: class: nasality:
m labial sonorant cons nasal
where c o n s = "consonantal." All tokens in a given stream have the same fields, but of course the values for the fields can differ.3 Also, the value for a field can be undefined, when the value is not relevant for the token. Fields can be of different types, as discussed below. In most of the sample deltas, only the value of the name field is displayed, but it should be kept in mind that the tokens may have other fields as well. A field together with a particular value is called an attribute. Thus the phoneme token illustrated above has the attributes
, < p l a c e : l a b i a l > , < m a n n e r : s o n o r a n t >,and < n a s a l i t y : n a s a l > . The non-name attributes of a token are also called features, so that we can speak of the token named mas having the features < p l a c e : l a b i a l > , < m a n n e r : s o n o r a n t > , etc. In general, token features are distinguished from token names in this paper by being enclosed in angle brackets. When the value of a field is unambiguous (i.e. when it is a possible value for only one field in the stream in question), the field name can be omitted in Delta programs, so we can also consider the m to have the features < l a b i a l > , < s o n o r a n t > , and so on. This abbreviated form for features will be used throughout the paper.
13.3.2
Delta definitions
The first thing that a Delta rule-writer must do is give a delta definition. A delta definition consists of a set of stream definitions that define the streams to be built and manipulated by the program (rules). Figure 13.1 shows fragments of possible phoneme and ¥0 stream definitions for Bambara. All text following a double colon (::) to the end of a line is a comment that is not part of the actual stream definition. The phoneme stream definition defines the tokens in the phoneme stream as having name, p l a c e , manner, c l a s s , n a s a l i t y , voicing, h e i g h t , and b a c k n e s s fields. (The stream names in the program fragments in this paper are all preceded by a percent sign.) The name field is a name-valued field; the p l a c e , manner, h e i g h t , and b a c k n e s s fields are multi-valued fields; and the c l a s s , n a s a l i t y and v o i c i n g fields are binaryfields.A name-valued field is a field that contains the token names of some stream as
SUSAN R. HERTZ :: Phoneme stream definition: stream %phoneme: :: Fields and values: name:
place: manner: class: nasality: voicing: height: backness:
m, n, ng, b, d, j, g, p, t, c, k, . .., i, in, I, e, en, E, a, an, . . . , u, un; labial, alveolar, palatal, velar, ...; sonorant, obstruent, ...; cons; nasal; ~voiced; high, mid, low; front, central, back;
:: Initial features: m n
has voiced, labial, sonorant, nasal, cons; like m except alveolar;
j g
has voiced, palatal, stop, cons; like j except velar;
u has voiced, back, high; un like u except nasal; end; ::FO stream definition: stream %F0: name: integer;
::defines all integers as possible names
end; Figure 13.1 Sample Delta definition.
possible values.4 A multivalued field is a field that has more than two possible values and is not name-valued and not numeric (see the next paragraph). A binary field is a non-name-valued field that has exactly two possible values, such as < n a s a l > or < ~ n a s a l > where " ~ " is Delta notation for "not." A binary field is always defined by specifying only one of the two possible values. The opposite value is assumed. The phoneme tokens in this example do not have any numeric fields. A numeric field would be defined by following the field name with a keyword specifying the kind of number (integer or floating-point) that can be a value of the field, as shown in the FO stream definition, where the name field is defined as having integers as possible values. Thus a field can be both name-valued and numeric; all other field types are mutually exclusive.
The Delta programming language
Below the field definitions for the phoneme stream are a set of initial feature definitions. These definitions assign initial field values to tokens with particular names. When a token is inserted by name into a delta stream, the token's initial values are automatically set, as discussed below. If a token is not given a value for some field by an initial feature definition, it is automatically given a default value. For binary fields, the default value is the value not specified for the field in the stream definition. For example, in the case of our sample phoneme stream definition, all tokens not given the value < c ons > for the c l a s s field will automatically be given the value< ~ c o n s > , and all tokens not given the value < ~ v o i c e d > f o r the v o i c i n g field will automatically be given the value < v o i c e d > . For multi-valued, name-valued, and numeric fields, the default value is the special built-in value < u n d e f i n e d > , unless the user specifies otherwise in the stream definition.
13.3.3
Sample program - synchronizing tokens
A Delta program consists of a delta definition followed by a set of procedures that operate on the delta. Figure 13.2 shows a short sample program that reads a sequence of phoneme tokens representing a Bambara word from the terminal into the phoneme stream, and synchronizes a C token in the CV stream with an initial consonantal token in the phoneme stream. (Later it will be shown how to apply the rule across the entire delta, synchronizing a C token with each consonantal phoneme.) This program consists of a single procedure called main. Every program must have at least a procedure called main, where execution of the program begins. : : Delta definition:
(delta definition goes here)
: : Main body of program: proc main(): : : Read phonemes from terminal into phoneme stream: read %phoneme; : : Synchronize a C token with an i n i t i a l consonant: [^phoneme _*left ! A ac] - > insert [%CV C] ^left. . . Aac; : : Print the resulting delta: print delta; end main; Figure 13.2 Sample Delta program. 223
SUSAN R. HERTZ
When the program begins execution, the delta has the following form (assuming that the streams shown are those defined by the delta definition) :5 (7)
phrase: word: morph: phoneme: CV: syllable:
1 1 1 1 1 1
1 1 1 1 1 1
tone:
I
I
The first line in procedure main, (8)
read %phoneme;
reads a sequence of phoneme token names from the terminal and places the tokens in the phoneme stream. The fields of each phoneme are set as specified in the phoneme stream definition. For example, given the phoneme stream definition in figure 13.1, if the sequence m u s o is entered, the delta would have the following form after the read statement has been executed: (9)
phrase: word: morph: phoneme: name: place: manner: class: nasality: voicing: height: backness:
m labial sonorant cons nasal voiced
u
~cons ~nasal voiced high back
CV: syllable: tone:
The dashes for the h e i g h t and b a c k n e s s fields in the m and the p l a c e and manner fields in the u represent the value { u n d e f i n e d ) . The next statement, (10)
[^phoneme _ A left (cons) ! A ac] - > insert [%CV C] ^left. . . Aac;
is a rule. A rule consists of a test and an action. The action is separated from the test by a right arrow (— >). 224
The Delta programming language
The test portion of the rule, (11)
[%phoneme _"left
! A ac]
is a delta test which tests the delta for a particular pattern of tokens and sync marks. Sync marks are referred to in Delta programs by means of pointer variables (also called pointers), such as " l e f t and A a c in our sample rule. (Pointer variable names in this paper always begin with a caret ["]. The variable " l e f t is a builtin pointer that always points at the leftmost sync mark in the delta, while A ac ("after consonant") is a user-defined pointer whose use is explained below. The expression _ " l e f t is the anchor of the delta test, which specifies the sync mark where testing starts - in this case, the leftmost sync mark in the delta. The test looks in the phoneme stream immediately to the right of " l e f t for a token that has the feature < c o n s > . If such a token is found, the expression ! A ac sets A a c to point at the sync mark immediately to its right. Since our sample phoneme stream does start with a consonantal token, the delta test succeeds and A a c is set to the sync mark following the token: (12)
phoneme: CV:
| m | u | s | o | | | '"left A ac
The action of the rule, (13)
i n s e r t [%CV C] " l e f t . . . A ac;
inserts a token named C into the CV stream between " l e f t and A a c : (14)
phoneme: CV:
| m I C A left
| u | s | A ac
| o | |
The expression " l e f t . . . A a c , which specifies where the insertion is to take place, is called the insertion range. Note that the sync mark pointed to by A a c , originally defined only in the phoneme stream, is now defined in the CV stream as well. In general, when an insertion is made between two sync marks in the delta, each sync mark is put into the insertion stream if it does not already exist in that stream. A sync mark that is put (defined) in a new stream is said to be projected into that stream. The final statement in our sample program, (15)
print delta; 225
SUSAN R. HERTZ
displays the delta, showing only the token names in each stream. (Other print statements can be used to display features.) The sample program ends after this statement. 13.3.4 Adjacent sync marks Adjacent sync marks in a stream act like a single sync mark for purposes of testing the stream. Consider, for example, the morph and t o n e streams of delta (1): (16)
morph tone:
| root | | root | |L |H |L | H |
The following delta test would succeed, despite the intervention of two adjacent sync marks between the roots in the morph stream: (17)
[%morph _ ^ l e f t
root
root]
The special built-in token GAP could be used as a "filler" token to prevent the two sync marks from being regarded as adjacent. If a GAP token were placed between the two r o o t tokens in the morph stream shown in example (16), the above test would fail. 13.3.5 One-point vs. two-point insertions The insert statement presented in figure 13.2 specified a two-point insertion, which places a token sequence between two sync marks already existing in the delta. An insert statement can also specify a one-point insertion, which places a token sequence to the right or left of a single sync mark. For example, given the delta (18)
phoneme: syllable: tone:
|m |u | syl | 1 2
|s |o | syl L 3 4
| | | L 5
| | ... | 6
the one-point insert statement (19)
insert
[%tone H] . . . A 5;
would insert a H token into the t o n e stream just before sync mark 5, automatically creating a new sync mark before the inserted H token :6 (20)
phoneme: syllable:
|m |
tone:
1 1
|u
Is
syl
1
|o syl
| 1 1 1 L 1 H |L 1 3 t 4 5 2 6 (new sync mark) 226
...
The Delta programming language
The new sync mark is unordered with respect to sync marks 2, 3, and 4, as discussed in the next section. It will be assumed in the remainder of this paper that in the examples that contain deltas with numbered sync marks, like the example above, ^1 has been set to point at sync mark 1, ^2 at sync mark 2, etc. 13.3.6 Sync mark ordering Within each stream, all the sync marks have an obvious left to right ordering. Across streams, however, two sync marks may or may not have a relative left to right ordering. Consider, for example, the following delta: morph: phoneme: syllable: tone:
|
root
1
syl
L 1 1 2
|
root a | b 1i syl 1 syl
1 1
1 m | u |s 1 o
C_l.
(21)
1a I
syl 1 1 1 H 1 L |
3 4 5
6
7
H 8
9
10
|
1 1 1
11 12
In this delta, sync mark 4 is to the left of sync marks 6 through 12, because sync mark 6 is in the t o n e stream after sync mark 4, and is also in the phoneme stream before sync marks 7 through 12. However, sync mark 4 is not ordered with respect to sync mark 5, because sync mark 5 does not exist in the t o n e stream, and there is no sync mark between sync mark 4 and 5 that is to the right of 4 or the left of 5 or vice versa. (Sync mark 4 could just as well have been displayed after sync mark 5.) By the same logic, sync mark 4 is also not ordered with respect to sync mark 2 or 3. Delta (21) might be posited as an early form for the word muso during the derivation of its surface representation, as explained below. It gives the tone pattern L H to the first root, without synchronizing either tone with a particular syllable. Delta's merge statement can be used to merge two unordered sync marks into a single sync mark, creating the appropriate synchronizations. For example, if the statement (22)
merge ^3 ~4;
were applied to the delta (21), it would produce the following delta: (23)
morph: root 1 1 phoneme: 1 m 1 u 1 s 1 o | syllable: 1 syl 1 syl 1 tone: 1 L 1 H 1L 6 5 3 2 1 (4) 227
SUSAN R. HERTZ
13.3.7
Contexts
The ordering of sync marks is important for determining sync mark contexts, which are in turn important for determining where sync marks can be legally projected and for determining the relationship between tokens across streams, as explained below. A sync mark's context in stream x is the region between the left context boundary and the right context boundary of the sync mark in stream x. The left context boundary of sync mark a in stream x is the closest sync mark in stream x that is equal to or to the left of sync mark a. Similarly, the right context boundary in stream x is the closest sync mark in stream x that is equal to or to the right of sync mark a. For example, in the delta (24)
morph: phoneme: syllable: tone:
root
1
1 1° 1 1 syl 1
1 m 1 u 1 s
1 | 1
syl L 2
3
1 1 1 1H 1 L 1
4 5 6
the left context boundary of sync mark 2 in the m o r p h stream is sync mark 1, and the right context boundary of sync mark 2 in the m o r p h stream is sync mark 6. Thus, the context of sync mark 2 in the m o r p h stream is the r o o t token (i.e. the entire region between sync marks 1 and 6). The context in the p h o n e m e stream of sync mark 4 is the sequence of phonemes and sync marks between sync marks 1 and 6 (recall that sync mark 4 is unordered with respect to sync marks 2, 3, and 5). If sync mark a exists in stream JC, it is its own context and its own left and right context boundaries. Note that every sync mark has a context in every stream. The terms left context and right context are used throughout this paper as abbreviations for left context boundary and right context boundary. The Delta language has the operators \ \ and / / for taking the left and right context of a sync mark. For example, the statement (25)
A
x = \\%syllable "5;
sets ^x to point at the sync mark that is the left context of ^5 in the s y l l a b l e stream - at sync mark 3 in the case of delta (24). In the following test, the context operators are used to test whether a H tone is the only token between the left context in the t o n e stream of sync mark 5 and the right context in the t o n e stream of sync mark 6. This test can be applied to delta (24) to determine whether the vowel o (surrounded by sync marks 5 and 6) is "contained i n " a H tone. (26)
[%tone _ ( \ \ %tone
A
5 ) H (// %tone " 6 ) ]
228
The Delta programming language
This test would fail in the case of delta (24), since the o is "contained in" a sequence of two tones, L H, but would succeed in delta (23), where the merging of sync marks 3 and 4 has provided an ordering of sync mark 5 with respect to the sync mark between the two tones. Delta lets users define default streams for different purposes. The remainder of this paper assumes that the t o n e stream has been defined to be the general default stream, which applies, among other things, to context operators. Thus, context operators not followed by a stream name in subsequent examples are assumed to apply to the t o n e stream.
13.3.8
Time streams
While sync marks can be used to specify gross temporal relationships, such as whether one token is before, after, partway through, or concurrent with another, time streams are needed to make precise temporal specifications - for example, that a token in one stream begins 75% of the way through the duration of a token in another stream, or 50 ms. after its end. The tokens in a time stream are divisible, in the sense that sync marks can be projected into the middle of them.7 When a sync mark is projected into a time token, the token is automatically divided into the appropriate pieces. For example, if a sync mark is placed a quarter of the way through a token with a duration (name) of 100, the token would get divided into a token with duration 25 and a second token with duration 75. A time stream is defined very much like any other stream, as illustrated by the following definition of a time stream called d u r a t i o n : (27)
time stream ^duration: name: integer end ^duration;
The keyword t i m e in the first line indicates that the stream is a time stream. Any number of time streams can be defined and used in a single delta. For example, a rule-writer might use one time stream for slow speech and another for fast speech, or one time stream for actual milliseconds and another for abstract phonological "time." Consider, for example, a hypothetical language that is like Bambara in all respects except that intervocalic consonants are ambisyllabic. In the following delta for muso in this language, an abstract time stream called a b s _ t i m e is used to represent the s as ambisyllabic: (28)
syllable: 1 syl syl 1 s phoneme: 1 m 1 u | 1 o abs_time: 1 1 1 1 1 .5 | . 5 1 1 200 duration: 1 70 1 120 1 1 150
229
I
SUSAN R. HERTZ
Note that given such a representation of ambisyllabicity, the context operators can be used to determine the phonemic composition of the syllables. The phonemes that comprise a syllable are those phonemes that are between the left context in the phoneme stream of the sync mark before the syllable and the right context in the phoneme stream of the sync mark following the syllable. In the Delta language, a time expression is used to refer to a particular point in a time stream. In its simplest form, a time expression consists of a time stream name, a pointer name, a plus or minus sign, and an expression representing a quantity of time. For example, given the delta phoneme: CV: nucleus: tone: duration:
s C
o 1
v I
C_l.
(29)
1 c
nuc
H 200
1
150 1
1 140
a V | V nuc H 200
the following time expressions would all refer to the instant midway between ^2 and ^ 3 : (30) (31) (32) (33) (34)
^duration %duration ^duration %duration %duration
^2 ^3 ^4 ~5 "2
+ — +
75 75 75 215 (. 5 * dur(/v2. . . ^3) )
where d u r ( A 2. . . A 3 ) in the last time expression uses the built-in function dur, which returns the duration between the specified points in the delta, in this case ^2 and ^ 3 . In general, the expression after the plus or minus sign in a time expression can be an arbitrarily complex numeric expression. Delta lets users define a particular time stream as the default time stream, so that the stream name can be omitted in time expressions that refer to that stream. Users can also specify in a time stream definition whether time is measured from left to right or right to left. When time is measured from left to right, a time expression with a positive stream offset, such as expression (30) above, refers to a point that is n time units to the right of the sync mark specified by the pointer variable, while one with a negative time offset refers to a point n time units to the left. When time is measured from right to left (as it might be, for example, for a Semitic language, which is written from right to left), a positive time offset specifies a time to the left, and a negative one a time to the right. It will be assumed in the remainder of this paper that each delta has a single time stream called d u r a t i o n , that this stream is the default time stream, and that time is measured from left to right. Time expressions can be used anywhere ordinary pointer variables can be used 230
The Delta programming language
- for example, in the range of an insert statement. Thus, given delta (29) above, the statements (35)
n =
(. 4 * dur ( 2 . . . 3 ) )
[%tone _\V'2 H //"3]
insert [%F0 160]
— >
+
n)...("3
- n);
would produce the following result (36)
phoneme: CV: tone: FO: duration:
1 1
1c ... | 1 1 200
1
1 1
0
V
H | 160 1 60 I 30 1 60 1
Notice that sync marks have automatically been placed at the appropriate timepoints in the delta. In general, when a time expression is used where a sync mark is required, the sync mark is automatically created in the time stream if it does not already exist. Delta has the special operator a t for placing a token at a single point in time. Thus if instead of the rule in example (35), the rule (37)
[%tone _\\"2 H //*3] -> insert [%P0 160] at ("2 +
(.5 * dur("2..."3)));
had been executed, the result would be the following: (38)
phoneme: |s CV: |C tone: ... | FO: | duration: | 200 A l
| |
o V
| | | ...
H | 75 "2
| 160 | | 0 | 75 | "3
The a t operator is special in that it will create two sync marks if no sync marks exist at the specified point in time, it will add one sync mark if a single sync mark exists, and it will use the outermost sync marks if several exist. Thus the a t operator makes it easy to place tokens in different streams at the same point in time, and synchronize the tokens with each other.
13.3.9
Forall rules
The previous sections have included several examples of rules. Each of these rules was restricted to operating at one particular point in the delta. For example, rule (10), (10)
[%phoneme _ " l e f t
!Aac] - > 231
insert
[%CV C] " l e f t . . . A a c ;
SUSAN R. HERTZ
is only tested against the first token in the phoneme stream. Usually, however, a rule is meant to apply at all appropriate points across the entire delta. Delta has several ways of applying rules across the entire delta or selected stretches of the delta. One of the most useful of these is the forall rule, which performs the action of the rule each time the rule's test succeeds. Consider, for example, the following forall rule, which is like rule (10) except that it is applied across the entire delta: (39) f o r a l i
[%phoneme _^bc < c o n s > ! A ac] - >
i n s e r t [%CV C] ^bc. . . A ac;
The first time the rule is executed, ~bc ("before consonant"), the anchor of the test, is automatically set to ^ l e f t . If a consonantal phoneme follows ^bc, the action of the rule is performed, synchronizing a C token in the CV stream with the phoneme. After the action is performed, and also whenever the forall test fails to match, ^bc is automatically advanced to the next sync mark in the phoneme stream, and the test is repeated. This advancing is continued until ^bc hits the rightmost sync mark in the delta, in which case the rule terminates. Assume that rule (39) is being applied to the following delta: (40)
phoneme: | m | u | s CV: | 1 2 3
| o 4
| | 5
First, ^bc would be set at sync mark 1. The forall test would succeed, setting, A ac at sync mark 2, and the action would insert a C token between ^bc and A a c : (41)
phoneme: | m | u | s | o | CV: | C | | 1 2 3 4 5 A bc A a c
Then ^ b c would be advanced to sync mark 2. T h e forall test would fail, since a vocalic, rather than consonantal, phoneme follows, ^ b c would then be advanced to sync mark 3, the forall test would succeed (setting A a c at sync mark 4), and the action of the rule would insert a C token: (42)
phoneme: | m j u j s CV:
| C | 1 2
| o |
| C | | 3 4 5 A bc Aac
The rule would continue to execute in this fashion until ^b c reaches sync mark 5. Users can override the system's default assumptions about forall rule application by specifying the pointer to be used as the advance pointer (by default the anchor, 232
The Delta programming language
as above), the initial setting of the advance pointer, where the advance pointer should start from on subsequent iterations of the rule, the stream through which to advance, in which direction to advance (left to right or right to left), and so on. These options can be specified rule by rule or more globally, to a sequence of rules. In general, the forall rule is a powerful construct with which users can specify precisely how their rules should apply. The forall rules in the remainder of this paper assume a global specification that sets the advance pointer to the sync mark preceding the next token (skipping adjacent sync marks) before each execution of the forall test. 13.3.10 Token variables The program fragments in previous sections have included several examples of pointer variables and one example of a numeric variable (example 35). In addition, Delta provides token variables, which hold entire tokens. The following forall rule uses a token variable to replace with a single phoneme all token pairs consisting of two identical vocalic phonemes. For example, it replaces two adjacent a's with a single a: (43)
forall [%phoneme _^bv < ~ c o n s > !$vowel $vowel ! A a v ] — > insert [%phoneme $vowel] ~bv. . . ~av;
Assume that this rule is being applied to the following delta: (44)
phoneme:
| j | a | a | b | i |
(The two a tokens might be used in the initial underlying representation for jaabi to represent the long vowel [a:], as discussed below.) The forall test first looks for a vocalic ( < ~ cons > ) phoneme following ^bv ("before vowel"). It will succeed when ^bv precedes the first a. The expression ! Jvowel puts a copy of the first a token in the token variable Jvowel. (Token variable names in this paper always start with a dollar sign.) The next expression, Jvowel, tests the token after a to see whether it is the same as the token in Jvowel - that is, whether it is also a. If so, A av is set after this vowel, and the action of the rule inserts a copy of the token in Jvowel between ^bv and A av, thereby replacing the two vowels and their intervening sync mark with a single a: (45)
phoneme:
| j | a | b | i |
13.3.11 Fences It is often necessary to prevent a delta test from matching patterns that span particular linguistic boundaries, such as morph boundaries. The scope of a delta 233
SUSAN R. HERTZ
test can easily be restricted to particular linguistic units by using fences. Sync marks in one or more specified streams (e.g., a morph stream or a s y l l a b l e stream) can be declared to be fences. A delta test fails if it would have to cross a fence in order to match the specified pattern. Fences are often used in conjunction with forall tests. For example, in the program fragment (46)
fence morph; forall [%phoneme _^bv <~cons> !$vowel $vowel !Aav] — > insert [%phoneme $vowel] ^bv... A av;
the fence statement specifies that all sync marks that are defined in the morph stream are fences, restricting the forall rule (which is the same as rule 43) to operating only on identical vowels that are within the same morph. A fence statement restricts the operation of all subsequent delta tests until its effects are reversed by another statement. 13.3.12 Conclusion This section has presented the delta data structure, and has shown various ways in which this data structure can be tested and manipulated. The concepts and constructs presented — streams, tokens, sync marks, contexts, delta tests, onepoint and two-point insertions, time streams, forall rules, fences, and others - are by no means a complete inventory of Delta language features; rather, they have been strategically selected for presentation in order to provide enough information to follow the sample programs in the next two main sections, which show applications of the Delta language to Bambara and English.
13.4 Modeling tone and fundamental frequency patterns in Bambara This section presents some sample programs that illustrate how Delta can be used to formalize and test a model of Bambara tone realization. This model is actually an integration of two separate models, a model of phonological tone assignment based on the work of Rialland and Sangare (1989), and a model of the phonetic realization of the phonological tones based on the work of Mountford (1983). Only the facts concerning monosyllabic, bisyllabic, and trisyllabic noncompound words in Bambara are considered. Words with four syllables exist, but are rare.
234
The Delta programming language 13.4.1
A model of Bambara tone assignment
Each Bambara word has an inherent (unpredictable) tone pattern. Monosyllabic and bisyllabic words either have a Low or a High tone on all the syllables. For example, the word muso is inherently low-toned ([muso]), and the word jaabi is inherently high-toned ([ja:bf]). The reason that muso was shown throughout this paper with a high tone on the final syllable will become clear in the ensuing discussion. A trisyllabic morph can have one of several patterns: 1) It can be inherently high-toned or low-toned in the same way as the monosyllabic and bisyllabic words are. For example, the morph [galama] 'ladle' is inherently low-toned, while [siingiirun] 'girl' is inherently high-toned. (In the transcriptions, word-final [n] or [n] before a consonant marks the preceding vowel as nasalized.) 2) It can have the pattern Low-High-Low, as in [sakene] 'lizard.' 3) It can have the pattern LowHigh or High-Low. In this case, it is optional whether the first tone is associated with the first syllable and the second tone with the last two syllables, or the first tone with the first two syllables and the last tone with the last syllable. For example, the High-Low morph mangoro ' mango' can be realized as [mangoro] or [mangoro]. Only the second realization will be considered for purposes of the program shown below. Iri addition to the inherent tones of morphs, Rialland and Sangare posit two other kinds of tones in underlying representations of Bambara utterances: 1) a floating Low tone that follows definite noun phrases, as illustrated in delta (1) above, and suggested by the work of Bird (1966), and 2) afloatingHigh tone suffix that follows all content morphs (as opposed to function morphs). AfloatingLow tone definite marker is not associated with any particular syllable in surface representations; it serves to trigger tonal downstep at the phonetic level, the lowering in fundamental frequency of following High tones. The High tone suffix for content morphs is realized on different syllables, depending on whether the content morph is part of a definite noun phrase - that is, on whether afloatingLow tone immediately follows the content morph. If there is no floating Low tone, the High tone is realized on the first syllable of the morph following the content morph, if there is one. Otherwise, it is realized on the last syllable of the content morph itself. For example, consider the phrase muso don 'It's a woman,' which has an indefinite noun phrase, and hence, no floating Low tone. In this phrase, the high tone associated with muso is added to the inherently low-toned don, producing the pattern [muso don], where the circumflex accent [A] represents the tone pattern High-Low. (Here, the content tone has been added to the tone of the one-syllable morph. In a polysyllabic morph, the High tone simply replaces the inherent tone associated with the first syllable.) On the other hand, the phrase muso don 'It's the woman,' which has a definite noun phrase, is realized as [muso don]. See Rialland 235
SUSAN R. HERTZ
and Sangare (1989) and Bird, Hutchison, and Kante (1977) for a fuller description and analysis of Bambara tone patterns.
13.4.2
Formulation of the tone model in Delta
According to the model just presented, Bambara tone patterns consist of sequences of high and low tones. Each tone is either an inherent part of a morph, a floating definite marker, or a floating content marker, as illustrated in figure 13.3, which shows several underlying representations of Bambara utterances in delta form. In the transcription in example (5) in the figure, the wedge v represents the tone pattern Low-High. Given underlying forms like these, the correct tone pattern can be assigned in two main steps: 1. Assign each floating H tone to the appropriate morph. Put it at the end of a preceding morph if a floating L tone follows (i.e., move the floating H tone to the left of the right context of the morph in the t o n e stream). Otherwise, attach it to the beginning of the following morph. 2.
Merge unordered pairs of syllable and tone sync marks from right to left within each morph until there are no unordered pairs left, thereby creating the appropriate synchronizations between tones and syllables.
Step 1 would create the forms shown in figure 13.4 for the sample deltas in figure 13.3. Step 2 would operate on the deltas in figure 13.4 to produce those shown in figure 13.5. Note that this step does not change delta 1, since this delta contains no sync marks in the s y l l a b l e and t o n e streams that are unordered with respect to each other. In delta 5, the right to left merging of sync marks correctly leaves the first syllable synchronized with two tones, L and H. Left to right merging of the sync marks would produce the wrong result. A final step might be to combine adjacent H tones and L tones into single H and L tones, but it does not make any difference for purposes of the program below that generates fundamental frequency patterns on the basis of the tone patterns whether a single tone is associated with several syllables, or each of the syllables has its own tone. Step 1 can be accomplished in Delta by the forall rule shown in figure 13.6. The forall test (47)
([%tone _"bh H !"ah] & [%morph _"bh "ah])
contains two independently anchored parts. The first matches any H token after "bh, setting "ah after it. The second tests whether "bh and "ah point at adjacent 236
The Delta programming language 1) 'It's a woman' (surface form: muso don):
morph: phoneme: syllable: tone:
| root | | root | m | u | s | o | |d|on | syl | syl | | syl | L | H | L
| | | |
2) 'It's the woman' (surface form: muso don): morph: phoneme: syllable: tone:
| root | | m | u | s | o | | syl | syl | | L | H
| | root | |d|on | | syl | L | L
| | | |
3) 'answering the woman' (surface form: muso jabi): morph: phoneme: syllable: tone:
| root | | m | u | s | o | | syl | syl | | L | H
| | | | L
| root | | j | a | b | i | | syl | syl | | H | H
| | | |
4) 'It's a mango' (surface form: mangoro don): morph: phoneme: syllable: tone:
| root | | m | a n | g | o | r | o | | syl | syl | syl | | H | L | H
| root | d | on | syl | L
| | | |
5) 'It's the lizard' (surface form: sakene don): morph: phoneme: syllable: tone:
| root | s | a |k|e | syl | syl |L | H
|n|e | syl | L
| | | | H
| | | | L
| | | |
root d | on syl L
| | | |
Figure 13.3 Initial underlying deltas for Bambara utterances.
sync marks in the morph stream - i.e. whether the H token matched by the first part is a floating tone. The action of the forall rule is a do statement, which groups (between the keywords do and od) a sequence of Delta statements that function syntactically like a single statement. This particular do statement includes an if statement, which groups one or more rules into a disjunctive set. The last part of the if statement is an else statement, which is executed if none of the preceding tests within the if statement succeeds. 237
SUSAN R. HERTZ
1) 'It's a woman' (surface form: muso don): morph: phoneme: syllable: tone:
| root | m |u |s |o | syl | syl | L
| root | d | on | syl | H| L
| | | |
2) 'It's the woman' (surface form: muso don): morph: phoneme: syllable: tone:
| root | | root | | m | u | s | o | | d | o n | | syl | syl | | syl | | L | H | L | L |
3) 'answering the woman' (surface form: muso jabi): morph: phoneme: syllable: tone:
| root | | m | u | s | o | | syl | syl | | L | H | L
| root | j | a | b | i | syl | syl | H | H
| | | |
4) 'It's a mango' (surface form: mangdro don): morph: phoneme: syllable: tone:
| root | m | an | g | o | r | o | syl | syl | syl | H | L
| root | | d | on | | syl | H| L |
5) 'It's the lizard' (surface form: sakene don): morph: phoneme: syllable: tone:
| root | | root | s | a | k | e | n | e | | d | on | syl | syl | syl | | syl I L | H | L | H | L | L
| | | |
Figure 13.4 Underlying deltas after floating high tone assignment.
The second step, which merges the appropriate sync marks in the s y l l a b l e stream with those in the t o n e stream can be expressed as the forall rule shown in figure 13.7. The forall test (48)
[%morph _^bm < > !Aam]
uses the empty angle-bracket notation ( < > ) , which matches any token - in this 238
The Delta programming language
1) 'It's a woman': morph: phoneme: syllable: tone:
| root |m |u | s | o | syl syl |
| | I |
root | d | on | syl H |L
2) 'It's the woman': morph: phoneme: syllable: tone:
| root |m |u | s | o | syl | syl | L | H
| I I | L
| root | d | on I syl | L
| | | |
3) 'answering the woman': morph: phoneme: syllable: tone:
| root |m |u | s | o | syl | syl | L | H
| | | |
root | b | i | syl | H
j Ia syl H
| | | |
4) 'It's a mango': morph: phoneme: syllable: tone:
| root | m | an | g | o | syl | syl | H
r | o syl L
| root | d | on | syl | H|L
5) 'It's the lizard': morph: phoneme: syllable: tone:
| | s |a | syl IL IH
root | k |e I syl I L
| |n |e | | syl | | H |
| root | | d | on | I syl | I L |
Figure 13.5 Surface forms produced by sync mark merging.
case, the token following ^bm. The action of the forall rule includes a repeat statement, the action of which is executed repeatedly until a statement in the action, in this case an exit statement, causes the loop to terminate. The loop moves pointers ^ b t and ^bs right to left through the morph, merging the appropriate sync marks, and terminating whenever ^ b t or ^bs equals ^bm — that is, when one of these pointers reaches the beginning of the morph.
239
SUSAN R. HERTZ :: Forall floating H tones (Abh = "before H", A ah = "after H"). forall ( [%tone _^bh H ! A ah] & [%morph _Abh *ah] ) - > do if If the floating H occurs before a floating L, move the H tone into the end of the preceding morph. Otherwise, insert the H tone at the beginning of the following morph. Moving the H tone is accomplished by inserting a new H tone and deleting the floating one. ( [%tone _~ah L !^al] & [%morph _Aah A al] ) - > insert [%tone H] . . . A bh; else — > insert [%tone H] Aah...; : : Delete original floating H & following sync mark: delete %tone Abh. . . "ah; delete %tone A ah; od; Figure 13.6 Forall rule for floating High tone assignment.
13.4.3 Building underlying representations using dictionaries The program fragments shown above assume an underlying representation in which each morph is associated with a tone pattern, and floating High and Low tones are present where appropriate. A Delta program designed to test the rules incorporated in the above program fragments would have to build the appropriate underlying representations in the first place. T o prevent users from having to enter underlying representations for the utterances they wish to test (a tedious task), the program could be designed such that the user only has to specify phoneme names, morph boundaries, and the unpredictable Low floating tones. For example, our sample utterance muso jaabi might be entered as follows: (49)
+ m u s o
+ j a a b i +
where + designates a morph boundary and [v] a floating Low tone. Two a tokens in succession represent the long vowel [a:], as discussed earlier. On the basis of this information, the program could easily build the following structure: (50)
morph: | root | phoneme: | m | u | s | o |
|
CV:
| C | V | C | V |
| C | V | V | C | V |
tone:
|
| L |
root | | j | a | a | b | i | |
The next step would be to insert the inherent morph tones and the floating high 240
The Delta programming language : : For each morph (^bm = "begin morph", A am = "after ::morph"). . . forall [%morph _^bm < > ! A am] — > do :: Set A bs (begin syllable) and ^bt (begin tone) :: to A am (after morph): ^bs = Aam; A bt = Aam; repeat — > do :: Set ^bt before the next tone token to the :: left. If there are no more tone tokens in :: the morph (i.e., ^bt has reached ^bm), exit :: the loop. [%tone !^bt < > _^bt]; (^bt = = ~bm) - > exit; :: Set ^bs before the next syllable token to :: the left. If there are no more syllable :: tokens in the morph, exit the loop. [^syllable !^bs < > _^bs]; (^bs = = ^bm) — > exit; :: Merge the sync mark before the tone and the :: sync mark before the syllable: merge ^bt ^bs; od; od; Figure 13.7 Forall rule for sync mark merging.
content tones. Since neither the inherent tones nor the content tones are predictable from this delta, a dictionary must be used. Delta has two kinds of dictionaries, action dictionaries and set dictionaries, or simply sets, both of which are
useful for creating underlying representations from forms like the above. An action dictionary contains token name strings (henceforth "search strings") and associated actions. An action for an entry, which can consist of any legal Delta statements, is automatically performed whenever a search string is matched - that is, an identical token name sequence is looked up in the dictionary. For our purposes, an action dictionary named morphs might be defined to contain all morphs (represented in terms of phoneme names) and associated tone patterns.8 More specifically, the action for each morph might be an insert statement that inserts the appropriate tones for the morph into the delta, as shown in figure 13.8.9 An action dictionary always has at least two pointer variables associated with it (in this case ^b and ^e), which take on the values of the pointers delimiting the token name sequence being looked up. 241
SUSAN R. HERTZ actdict %phoneme morphs (*b ^ e ) : entries: :: morphs with low tones: a, b a 1 a, d on, g a 1 a m a, m u s o,
:: 'he' : : 'porcupine' : : 'this is' :: 'ladle' :: 'woman'
— > insert [%tone L] ^b...^e; :: morphs with high tones: s k j j
un g u r un, u r un, a a b i, i,
: : : :
: : : :
'girl' 'canoe' 'answer' 'water'
-> insert [%tone H] *b. . . *e; :: morphs with high-low patterns: m an g o r o,
: : 'mango'
— > insert [%tone H L] ^b. . . *e; :: morphs with low-high-low patterns: s a k e n e,
: : 'type of lizard'
— > insert [%tone L H L] A b. .. A e; ; :: end entries end morphs; Figure 13.8 Sample morph dictionary.
The following forall rule operates on a delta of the form shown in delta (50) above, looking up in dictionary morphs the phoneme names associated with each morph in the delta: (51)
forall [%morph _^bm < > !^am] — > find ^phoneme morphs;
A
bm. . . A a m in
If the phoneme sequence is found, the dictionary automatically synchronizes the appropriate tones with the morph, creating, for example, the following deltas for muso jaabi and mangoro don: (52)
morph:
|
phoneme: tone:
| m | u | s | o | | j | a | a | b | i | | L | L | H |
root
|
|
242
root
|
The Delta programming language (53)
morph: phoneme: tone:
I
1 m 1an 1 H 1
root 1 1 root 1 g 1 o| r | 3 c 1 1 d | on 1 1 L L 1L 1 1
The next step would be to insert the floating High content tones after the content morphs. This step could be accomplished by adding the appropriate insert statement to the action of all content morphs in dictionary morphs. However, since there are so many content morphs, a simpler strategy would be to "mark" the non-content morphs as function morphs, and insert the content tone for all morphs that are not function morphs. The simplest way to mark the appropriate morphs is to place them in a set. A set contains search strings, but no actions. For example, one might define a set called f u n c t i o n _ m o r p h s as follows: (54)
s e t %phoneme function_morphs:
a, d on, . . . ;
(Since all of the morphs in set f u n c t i o n _ m o r p h s are also in the action dictionary morphs, an alternative to this set definition would be to follow each function morph in the action dictionary with an expression that places the morph in set f u n c t i o n _ m o r p h s . ) Given set f u n c t i o n _ m o r p h s , a rule could be added to forall rule (51) above to insert a H tone after any morph not found in the set: (55)
f o r a l l [%morph _~bm < > !Aam] - > do ~ f i n d %phoneme ^bm... Aara i n function_morphs i n s e r t [%tone H] A am... d u p l i c a t e ; od;
— >
The option d u p l i c a t e at the end of the insert statement specifies that the new sync mark after the inserted H tone should be projected into all streams in which A am, the range pointer, is defined. This rule would operate on deltas (52) and (53) to produce the following underlying representations for musojaabi and mangoro don: (56)
(57)
morph: phoneme: CV: tone:
1m
1 1 c 1 1
root
I
| root | j | a | a | b I C | V | V | C
I s Io I I C IV I
I
morph: root I | r | o | phoneme: 1 m | an | g | C | V | CV: 1 cI tone: L | 1 I
H
I d C
root | on I V
A final step in creating the underlying representation would be to reduce sequences of identical vocalic phonemes that are not separated by a morph 243
SUSAN R. HERTZ
boundary, such as the a a of jaabi, into a single phoneme, as shown earlier in example (46). 13.4.4
A model of Bambara fundamental frequency patterns
On the basis of the CV and t o n e streams, the appropriate F o pattern for the tones can be determined. This section considers one possible strategy of F o determination, based on the work of Mountford (1983). According to Mountford, High and Low tones descend along relatively straight and independent lines, as shown schematically in the diagram of a sentence with the tone pattern H L L H L H L in figure 13.9. The starting and ending frequencies of the baseline are relatively independent of the sentence duration. Thus the slope of the baseline varies with sentence duration. The sample program below uses a starting frequency of 170 hertz and an ending frequency of 130 hertz for the baseline. The starting and ending frequencies of the topline are a function of sentence duration. For the sake of simplicity, this detail is ignored in the sample program, which assumes a starting frequency of 230 hertz and an ending frequency of 150 hertz.10 My own very preliminary laboratory studies of Bambara tone patterns suggest that the F o tone targets are generally realized halfway through each syllable nucleus - that is, long vowels behave the same way as short vowels for the purpose of F o target placement. Furthermore, the F o targets tend to have no durations of their own, serving only to shape the overall F o contour. 13.4.5
Formulation of the fundamental frequency model in Delta
The model of F o assignment just outlined can easily be implemented in Delta, as shown by the program fragment in figure 13.10, which generates a descending topline and a descending baseline, and determines the appropriate F o values along those ramps. The program assumes that each phoneme was given a duration by earlier statements. The forall rule in this program would generate the following values for our sample delta for musojaabi (assuming the segment durations shown and a sentence duration of 1080 ms): (58)
root morph: root 1 1 1 0 phoneme: u 1s| 1 m1 1 1 j 1 a 1 V CV: 1c| 1 c1 v 1 1 c 1v | v I | nuc nuc | nuc nucleus: 1 1 1 1 1 1 syl syllable: 1 syl syl 1 1 1 1 H H L tone: 1 1 1 L :1 |165 1 FO: 1 1172 1 1 1 1 1 1 |195 1 1 |140 |95 | 0 |95 | duration: |70 |60 0 |60 |200 (75 | 0 |75 |
The program has been oversimplified for purposes of illustration. For example, 244
The Delta programming language
• topline
baseline
H
L
L
H
L
H
L
Figure 13.9 Baseline and topline.
it does not handle single syllables that have two tones, such as don in utterance 1 in figure 13.5 above, nor does it compute the starting and ending frequencies of the high tone line as a function of sentence duration. Furthermore, it does not handle the tone raising and lowering phenomena that occur in Bambara for tones in particular contexts. Rialland and Sangare (1989) posit a rule that downsteps (lowers) a High tone and all subsequent High tones after a floating Low tone. See also Mountford (1983) for some posited raising and lowering rules. Our sample program could easily be expanded to handle all of these phenomena. In addition to Fo values, a program designed for synthesis would have to compute or extract from a dictionary the values for other synthesizer parameters. Delta's generate statement can be used to create a file of parameter values and durations for the synthesizer on the basis of the parameter streams and an associated time stream: (59)
generate %F0, %F1, %F2t %F3, . . . time ^duration;
Also, users can tailor their program to their particular synthesizer by writing their own C programs to generate values for the synthesizer, and their own synthesizer driver. (The ability to interface C programs at will lets Delta be used in conjunction with all kinds of programs, including programs for demisyllable and diphone synthesis.) 245
SUSAN R. HERTZ H_start_freq H_end_freq L_start_freq L_end_freq
= = = =
230; 150; 170; 230;
: : : :
: : : :
Starting frequency of High tone line Ending frequency of High tone line Starting frequency of Low tone line Ending frequency of Low tone line
: : Sentence duration: sent_dur = dur("left..."right); :: Slope of High tone line: H_slope =
(H_end_freq — H_start_freq) / sent_dur;
:: Slope of Low tone line: L_slope =
(L_end_freq — L_start_freq) / sent_dur;
:: Insert an ¥0 value for each nucleus: forall [^nucleus _"bn nuc ! A an] — > do :: Compute the duration from the beginning of the :: sentence to the point halfway through the V ::(the point where we wish to compute and insert : : the FO value): half_nuc_dur = .5 * dur("bn..."an); elapsed_time = dur("left..."bn) + half_nuc_dur; : : Compute the FO value depending on whether the : : nucleus is low- or high-toned: if [%tone _\\"bn H //"an] - > fO_val = H_slope * elapsed_time + H_start_freq; else — > fO_val = L_slope * elapsed_time + L_start_freq; :: Insert the computed FO value :: halfway through the nucleus: insert [%F0 fO_val] at ("bn od;
+ half_nuc_dur);
Figure 13.10 Delta program for Fo target assignment.
13.5 Modeling English formant patterns The previous section illustrated how Delta can be used to formalize and integrate a phonological model of tone assignment and a phonetic model of tone realization. This section demonstrates further Delta's flexibility by drawing from my own work on modeling English formant patterns to show how different hypotheses I have come up with over the years can be formulated in Delta. The section focuses 246
The Delta programming language
on my most recent model, in which formant targets and intervening transitions are represented as independent durational units that are related to higher-level phonological constituents in well-defined ways. This model relies critically on the concept of synchronized units at both the phonological and phonetic levels. 13.5.1
Model 1: multilevel utterance representations, implicit transitions
The first work I did on modeling English formant patterns was in the context of my students' and my synthesis rule development. Our SRS synthesis rules for English, developed between 1978 and 1982, are based on the hypothesis that every phoneme (i.e. phoneme-sized unit) has an intrinsic duration that is modified according to such factors as segmental context and stress. Formant targets (usually two of them) are set in relation to the segment's duration - for example, 20% and 80% of the way through the segment.11 For the most part, all durational adjustments, such as stretching before voiced segments, are made between the formant targets within given segments, so that the durations of the formant transitions between targets in adjacent segments remain constant. In the course of implementing a set of SRS rules based on the model just outlined, we realized that our formant target and duration rules might be simplified if we treated certain sonorant sequences that act as a single syllable nucleus, such as [ei], [ai] and [ar], as two units for the purpose of assigning formant targets, and as single units for other purposes, such as assigning amplitude patterns. (A syllable nucleus in this model consists of a vowel + a tautosyllabic sonorant, if such a sonorant exists.) Such a structure was impossible to represent straightforwardly with SRS, which relies on linear utterance representations, but is quite simple to represent with Delta, as shown below for the word ice: (60)
syllable: | syl | nucleus: | nuc | | phoneme: | a | i | s |
By representing the [i] of syllable nuclei like [ai] as an independent phoneme token, the rules can assign a single target value for each formant - say, an F 2 value of 2000 hertz - to all i's, regardless of their context. Syllable nuclei can still be treated as single units where appropriate. While this model simplifies the prediction of formant values, however, it leads to complicated rules for positioning the formant values with respect to the edges of segments, since the formant target positions for a token in one context (e.g. i as the sole component of the nucleus) are not necessarily the same as those in another (e.g. i at the end of a diphthong). 13.5.2
Model 2: multilevel representations, explicit transitions
My recent durational studies support the view that formant transitions are relatively stable in duration. To avoid the complicated target placement rules 247
SUSAN R. HERTZ
alluded to above, however, I am exploring a new model, in which the transitions are represented as independent durational units, as shown below for the word ice: (61)
syl syllable: nucleus: nuc 1 1 phoneme: 1 a 1 s 1 i 1 1 |2000 1 | 1800 | 1400 1 F2: duration: | 45 | 90 | 30 | 40 | 90
1 1 1 1 1
Note that the formant transitions are represented by adjacent sync marks in the phoneme and F2 streams, with intervening durations in the d u r a t i o n stream. In this model, each formant target is synchronized with an entire phoneme token, and the duration-modification rules modify entire phoneme durations. In diphthongs, such as [ai], the duration rules modify only the first portion of the diphthong (e.g., the [a] of [ai]), thereby keeping the duration of the transition portion of the diphthong (e.g., from [a] to [i]) constant. Below are some forall rules that might be used to insert formant values and formant transitions, and to modify phoneme durations in the appropriate contexts, building a delta like (61) above. These examples assume that earlier statements have inserted the appropriate initial phoneme durations so that the delta for ice looks as follows before the rules apply: (62)
syl syllable: 1 nuc nucleus: 1 | i phoneme: 1 a
F2:
1
duration: 1 75
| 30
1 1 1 s 1 1 | 90 1 1
The following forall rule matches each phoneme, synchronizing a value for each formant with the phoneme formant: (63)
forall [%phoneme _A1 <> !*2] - > if : : F2 values [%phoneme _A1 ]! - > insert [%F2 2000] [%phoneme _A1 ]j - > insert [%F2 1400] [%phoneme _A1 ] - > insert [%F2 1800]
This rule operates on delta (62) to produce the following: (64)
syllable: 1 syl nuc nucleus: 1 phoneme: 1 a 1 i | 1400 12000 P2: duration: 1 75 130
1 1 s |1800 | 90 248
1 1 1 1 1
A
l. . . A 2; l...A2; A l. . . A 2 ; A
The Delta programming language
The next rule modifies the durations of vowels in certain contexts. In the case of delta (64), it shortens the duration of a to 60% of its previous duration, since it is in a nucleus that precedes a tautosyllabic voiceless segment. The statement f e n c e % s y l i a b l e preceding the forall rule prevents the delta tests in the rule from crossing a syllable boundary. (65)
fence ^syllable; forall [%phoneme _^ v <~cons> !^av] [%nucleus _^bv [%phoneme A dur^bv. . . av) *= .6;
where the expression
is a shorthand for dur(^bv...^av)
= .6 * d u r ( ~ b v . . . A a v )
This rule shortens the a in delta (64) from 75 ms to 45. (66)
syllable: syl syl syllable: | nuc nucleus: phoneme: 1 a 1 i | 1400 12000 F2: duration: 1 45 1 30
|
I
1 1 s |1800 1 90
I
A more complete program would modify the duration of consonants as well, depending, for example, on their position in the syllable and on whether the syllable is stressed. Finally, the following forall rule inserts transitions between adjacent phonemes: (67)
forall [^phoneme _^1 <> !^2] — > if [_^1 <sonorant> < obstruent >] — > insert [%duration 40] "2. . . duplicate; [_^1 <sonorant> <sonorant>] — > insert [%duration 90] ^ 2 . . . duplicate;
This rule operates on delta (66) to produce the following: (68)
syllable: nucleus: phoneme: I I F2: | 1400 | duration: | 45 | 90
I
syl i 2000 30
40 249
|
s 1800 90
SUSAN R. HERTZ
The transition rules presented here are hypothetical; at the time of writing I have not yet investigated in any systematic way what factors determine the transition durations. The transition durations probably depend as much on the place of articulation of the two phonemes as on their manner, or possibly even directly on the formant values they connect, or on other factors yet to be discovered. Independent developments in autosegmental CV theory (e.g. Clements and Keyser 1983) have suggested some minor revisions to the model, as illustrated by the following representation for ice, in which the diphthong-final i is differentiated from the vowel i by virtue of being synchronized with a C, rather than with a V: (69)
syl syllable: nuc nucleus: | CV: 1 c 1 v 1 phoneme: 1 a 1 i 1 | 2000 | 1400 1 F2: 1 90 1 30 duration: 1 45
i 1c 1s 40
1 |
|1800 | 90
The remainder of this discussion will assume a representation like the one in delta (69). Given a CV stream, one might hypothesize that every C and V is associated with a single value for each formant, reflecting the segment's place of articulation. However, this assumption does not work with the segment [h], which exhibits no targets of its own; rather, it acquires its formant structure on the basis of the preceding and following segments, its formants looking very much like aspirated transitions from the last formant target in the preceding segment to the first formant target in the following segment. Consider, for example, the schematic representation of the second formant pattern of the sequence [g h ai s], as in big heist, in figure 13.11. The traditional way to segment this utterance would be into the four chunks shown in the figure. A problem caused by this segmentation, however, becomes apparent when we consider the second formant pattern of the sequence [g ai s], the same sequence without the [h], shown in figure 13.12. The second formant patterns of the two sequences are for all intents and purposes identical except that the one with [h] has aspiration, rather than voicing, during the transition from the [g] to the target for the [a]. In order to write rules assigning a duration to the [a] in each of these examples, one would have to posit a rule that shortens the vowel after [h]. Such a rule would be very unusual in English, since modifications to English vowel durations are not generally triggered by preceding segments. Furthermore, in order to generate the F 2 pattern in the shortened vowel, one would have to move the target by precisely the amount that the vowel has been shortened. Given the transition model posited above, we might hypothesize instead that [h] adds no duration of its own, but rather, is realized by aspiration superimposed on 250
The Delta programming language ,„„„,„„„,„
— aspiration
I g
h
ai
Figure 13.11 Second formant pattern for English [g h ai s].
the transition between the preceding and following segments, as illustrated in the following delta for big heist: (70)
syll: nucleus: CV: phoneme: ... aspiration: ¥2: asp_amp: duration:
syl nuc
C
c
v a
g
I |
1 1
C i
asp
1400 |
2000 50
15 60
12000
45 1 90 1
30
where asp_amp is a stream containing amplitude values for aspiration. Here [h] is represented as a C in the CV stream, with an associated a s p token in an a s p i r a t i o n stream, but no associated phoneme token. Since the tokens in the phoneme stream have values for place and manner of articulation, among others, h in this representation is phonologically unspecified for these attributes. The absence of h in the phoneme stream simplifies the transition rules, which can refer directly to the sequence g a in determining the transition duration.12 (The C is necessary in the CV stream, since / h / functions phonologically as a consonant, preventing flapping of a preceding / t / or /d/, for example.) This model makes the prediction that the duration of the aspiration associated with h will be no greater than the duration of the transition between the preceding and following segments. However, recent studies I have conducted with Nick Clements have shown that for many speakers, [h] seems to consistently add a small amount of duration, on the order of 25 ms, to the transition. For such speakers, an additional rule would be required that adds a fixed duration for h to the transition. 251
SUSAN R. HERTZ
I g
ai
s
Figure 13.12 Second formant pattern for English [g ai s].
An obvious question that arises is whether aspiration in stops might be represented in a way similar to that for h, as in the following delta for tie: (71)
syllable: nucleus: CV: phoneme: aspiration: F2: duration:
syl V a | asp I | 1800 | | 70 | 60
1400 75
90
2000 30
Note that the token a s p in this case is not associated with a C or V in the CV stream, reflecting its different phonological status from that of h. Note also that the a s p token is not associated with any phoneme, which completely eliminates the much-debated problem of whether aspiration of stops in English should be considered for segmentation purposes as part of the preceding stop or part of the following vowel.13 This model is undoubtedly oversimplified in several ways. For example, aspiration does not always coincide with the entire transition; sometimes it stops before the end of the transition and sometimes after. Also, transitions for different formants do not always reach their targets at the same time. The model could be extended to handle these cases by adding additional duration tokens at the appropriate points. For example, a transition that is only partially aspirated could be modeled as two duration tokens, the first synchronized with an a s p token in the a s p i r a t i o n stream, and the second with a voicing token in a v o i c i n g stream (not shown above). Certainly the model has been oversimplified in other ways as well. The discussion has not been meant to give a full account of this model or even to be correct in all the details presented; the model is still very much under development 252
The Delta programming language
at the time of writing. Rather, the discussion is intended to show how models of this type, which rely heavily on synchronized phonological and phonetic units on different levels, can easily be accommodated in Delta, allowing linguists to explore the interface between phonology and phonetics.
13.6 Final remarks This paper has presented selected features of the Delta programming language and two linguistic models formulated in that language, one for Bambara tone and fundamental frequency patterns and one for English formant patterns. The paper has focused on Delta's central data structure, which represents utterances as multiple streams of synchronized units. This data structure gives rule-writers considerable flexibility in expressing the relationship between phonological and phonetic units. The examples of programming statements in the paper have been meant to give the flavor of the Delta language, rather than to describe all possible Delta constructs. No complete programs were presented, other than the oversimplified program in figure 13.2. Unlike the program in that figure, Delta programs generally consist of several procedures. Variables in Delta can either be local to a particular procedure or global to all procedures. Delta has several kinds of statements not described in the paper. For example, in addition to the insert, delete, and merge statements for modifying deltas, Delta has a mark statement for changing the attributes of tokens and a project statement for projecting an existing sync mark into a new stream; and in addition to the forall rule, Delta has a while rule, which performs the action of the rule so long as the rule's test succeeds. Delta also has far moreflexibleinput and output facilities than the simple read and print statements illustrated (including the ability to read and write deltas), and it has fully general numeric capabilities. Delta tests can also include many kinds of expressions not illustrated, including expressions that test for optional occurrences of a pattern in the delta, for one of a set of alternative patterns, and for a pattern that must not be present for the test to succeed. Delta tests can even be temporarily suspended in order to execute arbitrary statements, a capability that adds enormous power to the language. Delta also has many kinds of tests not illustrated at all, including procedure calls, which succeed or fail according to the form of the return statement in the called procedure. A particularly nice feature of the Delta language is its ability to group tests of different types into a single test - such as a delta test, a dictionary lookup, and a procedure call, making a single action dependent on the success of several kinds of tests. The interactive development environment has not been illustrated at all, but it is an essential part of the system that greatly enhances the system's utility as a ruledevelopment tool. With its debugger, users can issue commands to temporarily 253
SUSAN R. HERTZ
stop execution each time the delta changes, each time a particular variable changes, each time a particular line in the program is executed, each time a particular procedure is called, and so on. When the program stops, users can examine and modify the delta, program variables and other data structures. The interactive environment also gives users very flexible facilities for specifying the files that programs should read from and print to, and it lets users create logs of their terminal session and their program input and output. It also lets users interactively invoke procedures in their Delta program (for example, a user could invoke a procedure to synthesize on the basis of the current delta). A particularly interesting application of the ability to call procedures interactively would be the implementation of a well-formedness checker. Commands could be issued to the debugger that cause it to execute a particular procedure each time the delta or a particular stream in the delta changes. This procedure could test the delta to make sure that whatever manipulations have been made to it do not violate any well-formedness conditions of the language in question. Furthermore, the debugger could be instructed to print out any statement that causes such a condition to be violated, making it easy for rulewriters to find out exactly what parts of their program cause such constraints to be violated. execcmd statement in a Delta program. A program might use an execcmd statement, for example, to invoke the well-formedness commands discussed above, so that the program automatically tests for any violations of well-formedness conditions whenever it is executed. In the future we plan to develop an interpreter for the Delta language, which would let users interactively modify and execute their programs without recompiling. Some time later, we also hope to enhance the Delta language itself. For example, future versions may let users construct and test arbitrary data structures, such as trees, arrays, and graphs. A user might, for example, choose to represent the tokens in a particular stream as trees of features, as in the multidimensional model of autosegmental phonology proposed by Clements (1985).14 Because the Delta System is still under development at the time of writing, little has been said about its performance or ease of use. We are encouraged, however, by our experience with an early version of the system. Students in speech synthesis classes felt comfortable writing programs with this version in anywhere from a few days to a few weeks. While large programs written with this version do not generally run in real-time, we have designed the current version of the system with real-time rule execution as a primary goal. An immediate plan of mine is to use the Delta System to further explore the model of English formant transitions hypothesized above. (It is possible that at the time of publication of this paper, the model will have been altered in significant ways.) Users elsewhere plan to use Delta to explore many other kinds of models, 254
The Delta programming language
including articulatory ones. Because of its ability to accommodate different models, the Delta System should help us increase our understanding of phonology and phonetics, and learn more about the interface between the two.
Notes The Delta System is an Eloquent Technology product, developed in cooperation with the Department of Modern Languages and Linguistics and the Computer Science Department at Cornell University. Many people have contributed to the system. Kevin Karplus, Jim Kadin, and, more recently, David Lewis, have been instrumental in its design. Jim Kadin, David Lewis, and Bob Allen implemented Version 2 of the system. I thank Mary Beckman, Russ Charif, Nick Clements, John Drury, Greg Guy, Jim Kadin, Peggy Milliken, and Annie Rialland for their helpful suggestions on earlier drafts of this paper, and John McCarthy for his insightful comments on the version of this paper presented at the First Conference on Laboratory Phonology. These comments motivated several revisions in this paper. Scott Beck and Russ Charif patiently assisted in many phases of the final production of this paper. 1 We plan to make Version 2.0 of the Delta System available for the following computer systems: Sun Workstations (Unix), DEC VAXes (Unix and VMS), IBM PCs (DOS), Macintoshes and Apollos (Aegis and Unix). 2 Earlier versions of the system compiled Delta programs into pseudo-machine instructions. The current approach (compiling into C) sacrifices the extremely compact storage of rules achieved by the pseudo-machine approach in favor of much faster rule execution and flexibility in interfacing C routines. 3 A future version of Delta is planned that will allow tokens in the same stream to have different fields. This next version will also allow tokens to be represented as trees of features, as in the model of autosegmental phonology proposed by Clements (1985). See section 13.6, "Final remarks," for more details. 4 It is not only the name field that can be name-valued. For example, Hertz et al (1985) gives an example of a t e x t stream definition that defines a name-valued field called d e f a u l t _ p r o n u n c with the names of the tokens in the phoneme stream as possible values, and shows how this field might be used in a Delta program for English text-tophoneme conversion to synchronize default pronunciations (phoneme tokens) with text characters. 5 Earlier papers about Delta showed initial deltas with a special GAP token in each stream (see section 13.3.4). Delta Version 2 lets rule-writers specify that a stream should be initialized to contain either a single specified token or nothing. "Nothing" is assumed by default. 6 In the examples of one-point insertions in earlier papers about Delta, the new sync mark bounding the inserted token was automatically projected into all streams in which the sync mark designated by the range pointer was defined. The system has been changed so that by default the new sync mark is only defined in the insertion stream. The d u p l i c a t e option to the insert statement can be used to cause one-point insert statements to work the way they did in earlier versions of Delta, as illustrated in example (55). 7 User-defined tokens in non-time streams are indivisible, (i.e., it is not possible to project sync marks into the middle of them). GAP tokens are always divisible. 8 In earlier papers about Delta, the sample action dictionaries were unnamed, reflecting the then-current version of the system, in which it was only possible to define a single, 255
SUSAN R. HERTZ
9
10 11
12 13 14
unnamed action dictionary. Delta has been enhanced to allow rule-writers to use any number of named action dictionaries. Bambara contains sets of morphs that differ phonetically only in their tone pattern. The dictionary action for such morphs might prompt the user for the intended tone pattern. Alternatively, the user might annotate the input string with the intended tone pattern, or, in some cases, the system might be able to perform some syntactic analysis to determine the correct pattern automatically. Current research on downtrend in other languages suggests that this model may be oversimplified. Since Delta has fully general numeric capabilities, it should be equally well-suited for expressing other algorithms for computing F o values. In all our SRS rules, not just those for English, some segments only have a single target for a particular formant, and others, like [h] in English and [r] in Japanese, have no targets for some formants. Non-initial English [h] is modeled by linear transitions connecting the last formant target in the preceding segment and the first in the following segment. The phonetic properties of [h] provide excellent evidence that / h / must be treated as phonologically underspecified, a point made by Keating (1985). It is interesting to note that Peterson and Lehiste (1960) found that when aspiration is considered to be part of the vowel, the vowel is lengthened by about 25 ms., the same duration increment we found for [h]. In the current system, rule-writers can use C to construct data structures of their choice. For example, they could construct tree-structured feature representations for tokens by using a numeric field in a token as a pointer to a tree structure defined in C.
References Beckman, M., S. Hertz and O. Fujimura. 1983. SRS pitch rules for Japanese. Working Papers of the Cornell Phonetics Laboratory 1: 1-16. Bird, C. 1966. Aspects of Bambara syntax. Ph.D. dissertation, UCLA. Bird, C , J. Hutchison and Mamadou Kante. 1977. An ka Bamankankan kalan: Introductory Bambara. Reproduced by Indiana University Linguistics Club. Chomsky, N. and M. Halle. 1968. The Sound Pattern of English. New York: Harper and Row. Clements, G. N. 1985. The geometry of phonology features. Phonology 2. Clements, G. N. and S. J. Keyser. 1983. CV phonology: a generative theory of the syllable. Cambridge, MA: MIT Press. Hertz, S. 1980. Multi-language speech synthesis. A search for synthesis universals (abstract). Journal of the Acoustical Society of America 67, suppl. 1. 1981. SRS text-to-phoneme rules: a three-level rule strategy. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 102-105. 1982. From text to speech with SRS. Journal of the Acoustical Society of America 72: 1155-1170. 1986. English text to speech rules with Delta. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2427-2430. Hertz, S. and M. Beckman. 1983. A look at the SRS synthesis rules for Japanese. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1336-1339. Hertz, S., J. Kadin, and K. Karplus. 1985. The Delta rule development system for speech synthesis from text. Proceedings of the IEEE 73, No. 11, pp. 1589-1601. 256
The Delta programming language
Keating, P. 1985. Phonological patterns in coarticulation, presented at the 60th Annual Meeting of the LSA, Seattle, Washington. Mountford, K. 1983. Bambara declarative sentence intonation. Research in Phonetics, No. 3, Indiana University, Bloomington. Peterson and Lehiste. 1960. Duration of syllable nuclei in English. Journal of the Acoustical Society of America., V.32, pp. 693-703. Rialland, A. and M. Sangare. (1989). Reanalyse des tons du Bambara: des tons du nom a Forg nisation generate du systeme, Studies in African Linguistics 20.1, 1-28.
257
14 The phonetics and phonology of aspects of assimilation JOHN J. OHALA
Some [scholars of language]... have allowed themselves ...to be led astray by paying more attention to the symbols of sound than to sounds themselves. ^r^ KLY 1852
...on paper almost everything is possible. OSTHOFF AND BRUGMANN 1878 [1967] Real languages are not minimal redundancy codes invented by scholars fascinated by the powers of algebra, but social institutions serving fundamental needs of living people in a real world. [In trying to understand] how human beings communicate by means of language, it is impossible for us to discount physical considerations, [i.e.,] the facts of physics and physiology. H A L L E 1954: 79-80. [Parts of this quote have been rearranged from the original without, I think, distorting the sense.]
14.1
Introduction
Assimilations of the type given in (1) are extremely common where, when two stops of different place of articulation abut, the first (Cl) assimilates totally to the second (C2).1 (1)
L. Latin
scriptu nocte
Sanskrit
bhaktum praptum labdha fret- ( f r i t h - ) + c o r *ad-gladam ad + bongid
Old Irish
> > >>
Italian Pali
scritto notte bhattum pattum
> > laddha >• freccor^f recur >> ac(c)aldam >> apaig (Thurneysen 1 961) 258
The phonetics and phonology of aspects of assimilation
Even more common are cases where a nasal assimilates to the place of articulation of a following stop, as exemplified in (2). (2)
L. Latin Shona
primu tempus amita N+tuta N + bato
> > > >
printemps ante (Mod. French tante) nfiuta
French Old French *nthuta > m6ato (Doke 1931)
There is no particular difficulty in representing such variation, whether linear or nonlinear notation is used, as in (3a) and (3b) (where the assimilation of nasals to stops is given). (3)
a.
[nasal]
+ nasal
-nasal
i
I
c
c
oc place
["stop "1 [a placej
--> [a place]/
+ nasal -nasal
II >
c
/? place
c
a place // place
Nevertheless, there is something profoundly unsatisfactory in these representations. As noted by Chomsky and Halle (1968:400fF), there is nothing in them which would show that the reverse process whereby a stop C2 assimilates to the place of articulation of stop or nasal Cl, i.e. as represented in (4a) and (4b), is rarely found and is quite unnatural. (4)
a.
[stop] --*• [a place]/|~ nasall
[a placej b.
+ nasal -nasal
+ nasal -nasal
II c
II c
>
c
c
a place fi place
Chomsky and Halle's solution to this was to invoke a marking convention which when linked to an assimilation rule would provide the "natural" or unmarked feature values. However marking conventions are clearly no solution at all; they are just a patch for a defective notation system, like an added epicycle in the Ptolemaic model of planetary orbits. If marking conventions could solve all defects of notation, then there would never be any motivation to adopt newer, "improved" notations - and yet this has occurred frequently in the past 30 years: feature notation in the early 60s, articulatory features in the 60s, autosegmental 259
JOHN J. OHALA
notation in the mid 70s, not to mention elaborations such as mirror image rules, alpha variables, etc. It is evident that phonologists want their notations to be revealing, to explain or render self-evident the behavior of speech sounds and to do this by incorporating some degree of isomorphism between the elements of the representation and the phenomena it stands for.
14.2 Past work relevant to assimilation If we want to represent the processes in (1) and (2) we must first understand them. The most common explanation given for them is that they come about due to "ease of articulation," i.e. that the speaker opts for an articulation that is easier or simpler than the original (cf. Zipf 1935: 96-97). Unfortunately, the notion of "ease" or "simplicity" has never been satisfactorily defined. It is true that when a heterorganic cluster becomes a geminate (and necessarily homorganic) or when a nasal assimilates in place to the following consonant, there is one less articulator involved but it does not follow so straightforwardly that this yields an easier task. No one knows how to quantify articulatory effort but certainly the neurological control operations should be counted, too, not just the energy required to move the speech organs. For all we know it may "cost" more to have the velum execute a closing gesture in the middle of a consonantal closure (as in Shona m6ato, above) rather than to synchronize this gesture with the offset of one segment and onset of another. Likewise, it may very well cost more to hold a consonantal closure for an "extra" amount of time - as required in geminates - rather than to give it a more normal duration. Finally, and this is the crucial defect, the notion of ease of articulation fails to explain why, in the above cases, it is typically Cl which assimilates to C2 and not vice-versa. A priori, it seems more plausible that if degree of effort really mattered, Cl is the consonant that should prevail in these assimilations, i.e. after supposedly "lazy" speakers adopt a given articulatory posture, one would expect them to maintain it during C2. That the opposite happens is sufficient reason to be highly suspicious of such accounts. There are also accounts of assimilation that do not rely on the notion of ease of articulation, e.g. Kent's (1936) "the speaker's thoughts are inevitably somewhat ahead of his actual utterance," but their relevance to cases of assimilation have not been demonstrated. Kent's basic notion is not implausible: speech errors do exhibit anticipation of sounds, e.g. teep a cape < keep a tape; but the character of such speech errors does not resemble in detail what one finds in (1) and (2). Often the anticipated sound is itself replaced by the sound it supplants and, moreover, the anticipated sound almost invariably is one occupying a similar position in another syllable, i.e. in onset, nucleus, or coda position. Also, it is universally accepted that all articulations (all voluntary movement, in fact) must be preceded by the "thoughts" that control them, but how, exactly, does "thought" get transduced into movement? It is not true that by just thinking or intending to say 260
The phonetics and phonology of aspects of assimilation
an utterance we actually say it. So Kent's account is not a sufficient explanation for anticipatory assimilation. There is reason to believe, though, that such assimilations owe very little to the speaker for their initiation (cf. Ohala 1974a, 1974b). Malecot (1958) and Wang (1959) showed that final heterorganic stop clusters created by tape splicing, e.g. /epk/ from joining /ep/ (minus the stop release) to the release of the final stop of the syllable /ek/, are identified overwhelmingly as single consonants having the place of articulation of C2. Malecot concluded: Voiceless p t k releases and voiced b d g releases contain sufficient cues for conveying both place and manner of articulation of American English plosives in final position. These cues are powerful enough, in most instances, to override all other place and manner cues present in the vowel-plus-closing transitions segment of those plosives...(380) In addition, there is abundant evidence that the place of unreleased final stops - i.e. where only the stop onset cues are present - is frequently misidentified (Householder 1956), suggesting that place cues are relatively less salient in this environment. In contrast, place cues for stops in pre-vocalic position are generally very strong, so much so that just the burst is generally sufficient to cue place (Winitz, Scheib, and Reeds 1971). Place cues are even weaker in the case of nasals, which, although as a class are highly distinct from non-nasals, are often confused among themselves (House 1957; Malecot 1956); thus when joined to a following stop it is not surprising that the listener has relatively less trouble hearing the nasal consonant as such but takes the place cue from the more salient stop release. The relative perceptual value of VC vs. CV transitions for intervocalic stops have been investigated in numerous studies (Repp 1976, 1977a, 1977b, 1978; Fujimura, Macchi, and Streeter 1978; Dorman, Raphael, and Liberman 1979; Streeter and Nigro 1979). The results of these studies show consistently (and, in the case of Fujimura et al., cross-linguistically) that when spliced VC and CV transitions differ, e.g. /eb/ spliced onto /de/ with a gap equal to that typical of a single stop, listeners generally "hear" only the consonant cued by the CV transitions, that is /eb-f de/ is heard as /ede/. The cause of this effect is still debated. One view is that recent cues dominate over earlier cues. Another view (e.g. Malecot's, quoted above) is that the VC and CV transitions have inherently different quality of place cues: especially in the case of stops, the VC cues reside almost totally in the formant transitions whereas the CV cues include transitions and the stop burst, the latter of which has been demonstrated to carry more reliable information for place than transitions alone (Schouten and Pols 1983). Against this view, however, is the evidence of Fujimura et al. that when the mismatched VC-CV utterances are played backwards, it is still the CV cues, that is, those originally from the VC portion, that dominate the 261
JOHN J. OHALA
percept. Moreover, there were slight but significant differences in the reaction of native speakers of English and of Japanese in their reaction to these stimuli which were attributed to the differing syllable structures of the two languages. These arguments support the notion that in addition to any physical differences between VC and CV cues, listeners' experience, including their native language background, dictates which cues they pay most attention to. These three hypotheses do not conflict; they could all be right. In order to explore further the influence of the listeners' prior experience on the interpretation of mismatched VC1+C2V utterances, the following two studies were done. These experiments were done as class projects by students in one of my graduate seminars.2
14.3 Experiment 1 14.3.1 Method An adult male native speaker of American English (from Southern California) recorded VCV and VC1C2V utterances where V was always [a] and the single C was any of the six stops of English, /p, t, k, b, d, g/, and the clusters consisted of the same six stops in C2 position and a homorganic nasal in Cl position, e.g. apa, ata, aba, anta, amba, etc. In addition, stress was placed in one reading on the initial vowel and, in a second reading, on the final vowel. The list is given in (5). (5)
'apa 'ata 'aka
'aba 'ada 'aga
'ampa 'anta 'anka
'amba 'anda 'arjga
a'pa a'ta a'ka
a'ba a'da a'ga
am'pa an'ta arj'ka
am'ba an'da arj'ga
The recording was done in a sound-treated room using high-quality recording equipment. These utterances were filtered at 5 kHz and digitized at a 10 kHz sampling rate and subjected to splicing such that within each of the eight groups in (5) the initial VC was spliced onto the final CV. Where there is only one intervocalic C, VC is the interval from initial vowel onset to the middle of the consonant closure and CV is the remainder. Thus the first group yielded the spliced utterances in (6). (6)
'ap-pa 'at-pa 'ak-pa
'ap-ta 'at-ta 'ak-ta
'ap-ka 'at-ka 'ak-ka
In the case of the homorganic nasal + stop clusters, the cut was also made at the middle of the nasal + stop closure. Thus 72 stimuli were created. 262
The phonetics and phonology of aspects of assimilation
These spliced tokens were copied, randomized, onto an audio tape with a 250 ms 1000 Hz sine wave "warning signal" 500 ms. before each stimulus and 4*6 seconds between each stimulus (with extra time between every sixth stimulus). Six tokens randomly chosen were also prefixed to the 72 test stimuli to serve as examples which would familiarize listeners to the type of stimuli they would be hearing. The entire tape, examples plus test stimuli, lasted about 9 minutes. Twelve volunteer listeners, native speakers of American English, were recruited from an elementary linguistics course at University of California, Berkeley, and were presented with an answer sheet and instructions which indicated that we were evaluating the intelligibility of synthetic speech and wanted them to identify some isolated utterances. The answer sheet gave three choices for each stimulus, one which represented the medial consonant or sequence as having the place of Cl, another C2, and a third as "other." For example, for the stimulus /ap-ka/ the possible answers were "apa, aka, other;" for the stimulus /an-pa/ the choices were "anta, ampa, other," etc. 14.3.2 Results Of the 576 responses (those where the place of articulation of Cl =1= C2), 93% were such as to show that the place of articulation of C2 dominated the percept. Differences in stress placement did not significantly influence the proportion of C2 responses nor was there any significant difference between stop + stop clusters and nasal + stop clusters. However, there was a significantly lower proportion of C2 responses in the case of voiced clusters (89%) vis-a-vis voiceless clusters (97%).
14.4 Experiment 2 14.4.1 Method For experiment 2 the same VC1+C2V utterances were used except that tokens where Cl = C2 were eliminated as were all tokens where Cl = nasal. The rest were modified by incrementing and decrementing the closure interval in 10 ms. chunks: 120 to 170 ms. for the voiceless clusters and 70 to 140 ms. for the voiced clusters. In the voiced series some cuts did not coincide with zero crossings but since the amplitude of the signal was so low this did not introduce any noticeable discontinuities into the stimuli. These tokens were randomized and presented to 18 listeners (young adult native speakers of English) in a way similar to that of experiment 1, except that the answer sheet gave the options of VC1V, VC2V, and VC1C2V. 14.4.2 Results Figure 14.1 presents graphically the pattern of listeners' responses in terms of the percentage of-CC- judgements (vertical axis) as a function of the closure duration (horizontal axis); the solid line gives the results for the voiced series and the 263
JOHN J. OHALA 100
(J O c
50
i
i
i
i
I
I
I
100
I
I
150
Closure duration (ms) Figure 14.1 Listeners' responses to stimuli in experiment 2 (see text). Ordinate: percent of 18 listeners' pooled responses identifying the stimulus as -C1C2-; abscissa: duration of stop closure (ms). Solid line: voiced C's; dashed line: voiceless C's (curves fitted by eye).
dashed line, the voiceless series. These data confirm previously published results: when the gap between the VC and CV transitions is short, listeners report only a single consonant, whereas when the interval is longer, more clusters are reported. However, the results extend earlier results by showing that the identification functions are different for the voiced and voiceless clusters: voiced stops were heard as clusters at shorter durations (down to 95 ms.) than were voiceless stops (only down to 150 ms.). One may speculate plausibly that this difference is a reflection of the intrinsic difference between the durations of voiced and voiceless intervocalic consonants: single voiced stops are typically shorter than single voiceless stops (see Westbury 1979: 98 for a review),3 in comparison with voiceless stop closures, listeners therefore must hear a shorter voiced stop closure before they relinquish the -CC- percept and embrace -C-. These results suggest that it is experience with the natural structure of speech which guides the listener in evaluating and integrating the cues present in speech: 264
The phonetics and phonology of aspects of assimilation
since there is a richer, more reliable set of place cues in the CV transition than the VC transition, listeners weight the former more heavily than the latter in deciding what they've heard. Since intervocalic clusters have a longer closure than single consonants, a longer closure duration is necessary for listeners to hear clusters. Since voiceless stops have longer closure durations than voiced stops, a shorter closure duration is necessary for the latter to be heard. This interpretation is compatible with all the experimental results reviewed earlier and supports the interpretation that the lack of salience of the VC transitions is mediated by linguistic experience and is not - or is not simply - a general psychological constraint. This interpretation also suggests that the sound changes shown above in (1) and (2) could have occurred due to less experienced listeners lacking the perceptual ability to integrate the weaker place cues in the VC transitions.
14.5
Discussion
These results have implications for a number of points. 14.5.1
A general theory of assimilation
The first experiment reported here and the earlier ones reviewed have, in effect, reproduced in the laboratory one aspect of the sound changes responsible for the data in (1) and (2) and can, therefore, be said to have (partially) explained it (see also Ohala 1987). They have shown that its source is to be found in the acoustic-auditory domain, not in the articulatory. The source of variation-change in pronunciation-happened not in the mouth of the speaker who uttered the test tokens in the experiment but in the ears of the listeners. Of course, the factors which give rise to the acoustic-auditory factors behind this asymmetry in direction of assimilation are ultimately articulatory and we can speculate fairly confidently about their nature. During a stop closure there is, by its very nature, a continuous increase in the air pressure behind the constriction. Thus at implosion there is low pressure but at release there is high pressure and consequently a high rate of airflow past the constriction. This high airflow creates audible turbulence, the burst. The burst has been shown to be a more reliable and robust cue to place of articulation (since the spectrum of the noise generated is largely determined by the resonating cavity forward of the constriction) than the formant transitions which occur as the articulators move towards or away from a constriction, where the pattern is determined as much by the cavity in back of the constriction as in front and is thus subject to more overlaid and thus obscuring influences (Ohman 1966; see also Ohala and Kawasaki 1984, and references therein). In the case of nasal 4- stop clusters we again have the highly reliable stop burst plus formant transitions cuing place vs. the less reliable place cues in the nasal and its formant transitions thus leading to the stop dominating the cues for place. 265
JOHN J. OHALA
I do not claim that all assimilations would be subject to the same principles. In fact I believe that many of the phenomena labeled "assimilation" are likely to be governed by very different principles (see note 1). Although I am not prepared to defend these beliefs here I think that voicing assimilation will exhibit very different tendencies (it would probably have a greater incidence of perseverative assimilation of voicelessness), as will assimilation that, unlike those in (1) and (2) above, involves only one articulator, e.g. velum, tongue tip, lips. 14.5.2
Sound change in general
These results add to the growing body of evidence pointing to the crucial role of the listener in initiating certain sound changes (Jonasson 1971; Ohala 1981, 1985a). This is not to deny that much of the synchronic variation in speech - from which diachronic variation arises - can be traced to the speaker or the physical principles which map articulation to sound (see e.g. Weymouth 1856; Ohala 1979, 1983a; Goldstein 1983); nevertheless, the role of the speaker has been greatly overemphasized in previous speculation on this point. Furthermore, these results reinforce a non-teleological view of sound change, that is, that neither speaker nor hearer chooses - consciously or not - to change pronunciation (Ohala 1975, 1985a; Hombert, Ohala, and Ewan 1979). Rather, variation occurs due to "innocent" misapprehensions about the interpretation of the speech signal or, as suggested above, due to listeners' inexperience. In this respect sound change is not unlike the transmission of scribal errors in the copying of manuscripts. It does not occur to "optimize" speech in any way: it does not make it easier to pronounce, easier to detect, or easier to learn. I acknowledge that this is a complex issue and that anyone holding an opposite view would not likely be convinced by these few remarks. Perhaps some of the works cited in the references at the end of this paper would serve that purpose. 14.5.3
Representation of sound patterns in a way that facilitates their explanation
How would one go about representing these factors in a way which would make them self-evident, that is, to fall out naturally from the representation and not to require propping up by external declarations like markedness (or other) conventions? The answer, I maintain, is models which incorporate known aerodynamic principles (Rothenberg 1968; Stevens 1971; Miiller and Brown 1980; Ohala 1975a, 1976, 1983b; Keating 1984), known principles relating vocal tract shape and acoustic output (Fant 1960; Stevens 1972), and some of the principles (to the extent that we known them) of how our auditory system extracts information from the acoustic signal (e.g. see Bladon and Lindblom 1981; Bladon 1986; Lindblom 1986). I will refer to these as "phonetic" models or representations.4 There has even been considerable success in developing models 266
The phonetics and phonology of aspects of assimilation
which incorporate two or more of these links in the speech chain (Flanagan, Ishizaka, and Shipley 1975; Lindblom 1986). It is possible to "wind up" such models and make them "go" and see natural speech sound behaviour happen. For the sake of explaining natural sound patterns there are advantages to representations using phonetic primitives - advantages not found in other currently popular phonological representations. One advantage is that a few wellchosen primitives go a long way. The basic anatomical, aerodynamic, acoustic, and perceptual constraints of speech can be and have been invoked to explain many quite specific forms of speech sound behavior. Table 14.1 lists just a few of these. This partial list constitutes by itself a "critical mass" of successful explanatory studies of sound patterns (even allowing that some may require revision) such that it would deserve serious attention. That they are all based on the same minimum machinery and are experimentally supported makes them doubly worthy of attention. In contrast, with currently fashionable phonological representations, each new fact considered as often as not requires ad hoc patches to the existing framework, e.g. "geminate integrity," "inalterability," "the obligatory contour principle," "the shared feature convention" (see Kingston, this volume). These patches must be established by decree; they do not "fall out" from the primitives assumed for the basic nonlinear mechanisms. How impressive would it have been if Newton, after presenting his theory which unifies free fall of terrestrial objects and planetary orbits, had added "and, oh yes, on top of this we also have to recognize that tides exist and that they are related somehow to the relative positions of the sun and moon " ? A second advantage - and this follows from the first, just mentioned - is that none of the terms of the explanation are unfamiliar, other-worldly entities. If Boyle-
Mariotte's Law has to be invoked to explain the devoicing of stops, it is the same Boyle-Mariotte's Law that applies in other parts of the familiar universe we live in (bicycle pumps, automobile pistons, party balloons, barometers, etc.). If the principle of "camouflage" is invoked in the auditory domain to explain dissimilation (Ohala 1981), it is the same camouflage that applies in the visual domain. In contrast, currently popular phonological representations "explain" sound patterns by conjuring up a vast array of devices and conventions that seem to apply exclusively to speech.5 In sum, phonetic accounts of natural sound patterns adhere to the constraint of Occam's razor, expressed by Newton as "more is in vain when less will serve." Physical and physiological representations may represent unfamiliar territory for most phonologists. Nevertheless, the one who asks the question presumably is responsible for providing the answer. It cannot be the case that inferior answers to questions are accorded any status in science because the asker shows no interest or ability in the domain where the answers lie. The answer to questions such as why C2 is favored in C1C2 assimilations of stops is to be found in these phonetic domains - not in spider-web networks of phonetic labels such as one finds in 267
JOHN J. OHALA
Table 14.1 1. Treatments of sound patterns due to production constraints: a. Devoicing of obstruents (Ohala 1983b), especially those with back articulations (Javkin 1977) and especially those with long closure duration; b. (Af)frication of stops before high close vowels and glides (Ohala 1983b); c. Devoicing and/or frication of high close vowels or glides (Ohala 1983b); d. Nasalization inhibits devoicing and/or frication of vowels and glides (Ohala 1978a, 1983b); e. Segments which do and do not block spreading nasalization (Ohala 1975b, 1983b); f. Stop epenthesis (stop preservation) in nasal-(-oral sequences, [s] + [l] (and vice versa) (Weymouth 1856; Phelps 1937; Ohala 1974b). 2. Treatments of sound patterns by reference to acoustic auditory constraints: a. Labiovelars behave like labials when interacting with fricatives but behave like velars when interacting with nasals and coloring the quality of adjacent vowels (Ohala and Lorentz 1977; Ohala 1979); b. Palatalized labials and velars change to apicals (Ohala 1978b, 1983a, 1985a); c. Labialized apicals and velars change to labials (Durand 1955); d. Cross-language prohibitions against labials + w, apicals-I-1, and apicals and palatals + j (Kawasaki 1982; Ohala and Kawasaki 1984); e. The above changes (and many others) are asymmetrical in their directionality (Ohala 1983a, 1985a). f. An account of dissimilation which explains: i. what features do not assimilate. ii. why dissimilation tends not to introduce new segments to a language whereas assimilation may do so, iii. why dissimilation requires the conditioning environment to remain in the process of the change, whereas assimilation does not (Ohala 1981, 1985a) g. Why nasalization tends not to be distinctive near nasal consonants (Kawasaki 1986); h. Spontaneous nasalization (Ohala 1983a); i. How nasalization affects vowel quality (Wright 1986; Beddor, Krakow, and Goldstein 1986).
autosegmental notation. Drawing a line between two primitive entities, however valid they might be as primitives, e.g. "obstruent," "labial," and "oral," or simply grouping them together inside square brackets, will not show how this combination will create an increase in oral pressure which, when released, gives rise to a rich set of place cues. Somewhere in the representation there will have to appear equations such as that in (7),6 (7)
air pressure = air mass x (1/volume) x constant,
reflecting that the more air one stuffs into a cavity, the more pressure rises (see 268
The phonetics and phonology of aspects of assimilation Table 14.2 Comparison of physical phonetic vs. autosegmental representations of sound patterns Goal
Physical Phonetic
Autosegmental
Comments
Explaining ' naturalness' Reflecting history of languages
Excellent Excellent for presumed phonetic initiation; not for subsequent transmission
Poor Undemonstrated
These two are related
Taxonomic descriptive
Poor to Fair; too cumbersome
Possibly very good; especially when a 'melody' needs to be abstracted from words manifesting it
There is no measure of success in description
Reflecting psychological structure
Makes no claim to this (but see Ohala 1986)
Largely unproven except for certain, speech error and word game data
This has nothing to do with naturalness
Representing phonology in a pedagogically effective way
Makes no claim to this
Makes no claim to this
Representing phonology in a computationally efficient way (e.g. for word parsing)
Makes no claim to this
Makes no claim to this
Ohala 1976, 1983b). Granted, it would be possible in principle to represent this fact through an ad hoc rule such as (8). (8)
0 - > burst / ["obstruentI
[etc.
J
But, as discussed above, we would soon find ourselves having to add more such rules - " epicycles"- each time a new consequence of the interactions of parameters was discovered. I do not regard autosegmental notation as useless for all purposes nor do I think phonetic representations are suited for all tasks which are legitimate concerns of the phonologist. It may be helpful if I present this evaluation in the form of the scorecard given in table 14.2. I presume there is no need to elaborate further on the reasons why I think that in comparison with phonetic representations autosegmental notation fails to explain the naturalness of common sound patterns. A closely related task, of course, is to give an account of the history of phonological events which gave rise to current sound patterns. Naturally, phonetic representations would excel at representing that stage at which phonetic factors played a role: the initiation of sound change; they would probably be of relatively little value in explaining the transmission of sound change. 269
JOHN J. OHALA
Nevertheless, there is a sense - perhaps an unintended sense - in which traditional, supposedly synchronic, phonological accounts are actually good descriptions (but still not explanations) of the historical sequences leading up to various sound patterns. This is so because in spite of modern phonology's goal of discovering the "mental processes" underlying language (Chomsky and Halle 1968: viii), the chief method used is still that of internal reconstruction, which is appropriate for historical but not psychological study. What autosegmental notation seems to be good at and for which physical phonetic representation is less suited is describing sound patterns and therefore classifying them, especially when certain phonological "melodies" need to be abstracted out of the forms which manifest them, e.g. tone and other prosodic properties of words such as quantity, Semitic-like separations of vowels and consonants in inflectional paradigms. These are the kind of phenomena autosegmental notation was originally developed to serve (Mattingly 1966; Goldsmith 1976). (I am not convinced that other "prosodies" such as vowel harmony or spreading nasalization, benefit from an autosegmental description as opposed to, say standard linear notation. And one of the more popular uses of autosegmental notation - that of being able to posit abstract segments without having to commit oneself to an arbitrary declaration about their phonetic character - strikes me as a clever solution to a pseudo-problem, or more accurately, a problem necessitated by arbitrary constraints on the form of linguistic description.7) But if nonlinear representation is useful as a descriptive device - that is, simply as a notation - it must be remembered that there are no absolute criteria for evaluating descriptions. Different modes of description may be suitable for different purposes. A hawk may be variously described as a vertebrate, a warmblooded creature, a predator, a creature found at the top of the food chain, one which has altricial (as opposed to precocious) young, etc. These descriptions are not mutually exclusive; they are not right or wrong; all have some usefulness; it makes little sense to try to "prove" that a hawk is a "predator." These are informal descriptors. We get a poor return (in terms of greater understanding) from any efforts to make them formal. As for representing the psychological structure of speech, physical phonetic representation makes no claim to this and the value of autosegmental notation for this goal has scarcely begun to be seriously investigated. The mere possibility of representing speech in a certain way - even in an allegedly economical way (but who is keeping the accounts?) - is insufficient evidence of the psychological reality of that representation. As for the other goals listed in table 14.2, neither representation has made any claim to meeting them. I list them simply to make the point that there are many jobs the phonologist takes on and a given representation may not be suitable for all of them. Ultimately we seek the best tool for each job. It is no disparagement of a given tool to point out that it is unsuitable for a certain purpose. Equally, 270
The phonetics and phonology of aspects of assimilation
however, there is no advantage to "selling" a tool for a function to which it is illadapted. Notes Pat Keating, Use Lehiste, Manjari Ohala, Janet Pierrehumbert, and Rich Rhodes provided helpful comments on earlier versions on this paper, for which I thank them. 1 This pattern is statistically frequent but not exceptionless; Tauli (1956) reports the following stop + stop assimilations in Estonian dialects:
but:
pikk (cf Finnish pitka) kak'ki ~ katki 'settember < September Vetseppe ~ retsepte
where in the last case, C2 assimilates to Cl. I am not concerned in this paper with assimilations between segments differing in manner and/or voicing. Nevertheless, it is worth noting that Murray (1982) has argued convincingly that cases such as Sanskrit to Pali patra- > patta- are not cases of assimilation of C2 to Cl but rather of gemination of Cl and subsequent loss of (original) C2. 2 The first one, a replication of the kind of studies just cited, was done by all members of the seminar: Eugene Buckley, Jeff Chan, Elizabeth Chien, Clarke Cooper, David Costa, Amy Dolcourt, Hana Filip, Laura Michaelis, Richard Shapiro, and Charles Wooters and is cited here as Buckley et al. (1987). The second experiment, which explored the influence of varying time intervals between the spliced segments, was done by Jeff Chan and Amy Dolcourt and is cited here as Chan and Dolcourt (1987). 3 Curiously, the duration of inter-vocalic voiced stop clusters is virtually the same as that for voiceless stop clusters, about 185 ms. in post-stress environment (Westbury 1979: 75ff). 4 Some "natural" sound patterns, e.g. dissimilation, may require for their explanation reference to phonetic and cognitive principles. 5 Cf. Isaac Newton's rejection, in his Optics, of "occult qualities" to explain natural phenomena. 6 To a physicist such equations have all the faults I have attributed to the notations in (3): they are arbitrary and do not show in a self-explanatory way why they may not be expressed differently. The physicist demands - and does possess - a yet more primitive model from which such equations may be derived. These equations nevertheless serve the phonologist as adequate primitives because they represent reliable general principles from which the target phenomena may be derived and yet which have been established independently of those phenomena. 7 The situation is not unlike that in Christian philosophy which has come up with clever explanations for the existence of evil in spite of the initial assumptions that God was omnipotent and beneficent.
References Beddor, P. S., R. A. Krakow, and L. M. Goldstein. 1986. Perceptual constraints and phonological change: a study of nasal vowel height. Phonology Yearbook 3: 197-217. 271
JOHN J. OHALA
Bladon, R. A. W. 1986. Phonetics for hearers. In G. McGregor (ed.) Language for Hearers. Oxford: Pergamon, 1-24. Bladon, R. A. W. and B. Lindblom, 1981. Modeling the judgment of vowel quality differences. Journal of the Acoustical Society of America 69: 1414—1422. Buckley, E. et al. 1987. Perception of place of articulation in fused medial consonant clusters. MS, University of California, Berkeley. Chan, J. and A. Dolcourt. 1987. Investigation of temporal factors in the perception of intervocalic heterorganic stop clusters. MS, University of California, Berkeley. Chomsky, N. and M. Halle. 1968. The Sound Pattern of English. New York: Harper and Row. Doke, C M . 1931. A Comparative Study in Shona Phonetics. Johannesburg: The University of the Witwatersrand Press. Dorman, M. F., L. J. Raphael, and A. M. Liberman. 1979. Some experiments on the sound of silence in phonetic perception. Journal of the Acoustical Society of America 65: 1518-1532. Durand, M. 1955. Du role de l'auditeur dans la formation des sons du langage. Journal de psychologie 52: 347-355. Fant, G. 1960. Acoustic Theory of Speech Production. The Hague: Mouton. Flanagan, J. L., K. Ishizaka and K. L. Shipley. 1975. Synthesis of speech from a dynamic model of the vocal cords and vocal tract. Bell System Technical Journal 54: 485-506. Fujimura, O., M. J. Macchi, and L. A. Streeter. 1978. Perception of stop consonants with conflicting transitional cues: a cross-linguistic study. Language and Speech 21: 337-346. Goldsmith, J. A. 1976. Autosegmental Phonology. Bloomington: Indiana University Linguistics Club. Goldstein, L. 1983. Vowel shifts and articulatory-acoustic relations. In A. Cohen and M. P. R. v. d. Broecke (eds.) Abstracts of the Tenth International Congress of Phonetic Sciences. Dordrecht: Foris, 267-273. Halle, M. 1954. Why and how do we study the sounds of speech? In H. J. Mueller (ed.) Report of the 5th Annual Round Table Meeting on Linguistics and Language Teaching. Washington DC: Georgetown University, 73-83. Hombert, J.-M., J. J. Ohala, and W. G. Ewan. 1979. Phonetic explanations for the development of tones. Language 55: 37-58. Householder, F. W. 1956. Unreleased /ptk/ in American English. In M. Halle (ed.) For Roman Jakobson. The Hague: Mouton, 235-244. House, A. S. 1957. Analog studies of nasal consonants. Journal of Speech and Hearing Disorders 22: 190-204. Javkin, H. R. 1977. Towards a phonetic explanation for universal preferences in implosives and ejectives. Proceedings of the Annual Meeting of the Berkeley Linguistic Society 3: 559-565. Jonasson, J. 1971. Perceptual similarity and articulatory re-interpretation as a source of phonological innovation. Quarterly Progress and Status Report, Speech Transmission Laboratory, Stockholm, 11971: 30-41. Kawasaki, H. 1982. An acoustical basis for universal constraints on sound sequences. Ph.D. dissertation, University of California, Berkeley. 1986. Phonetic explanation for phonological universals: the case of distinctive vowel nasalization. In J. J. Ohala and J. J. Jaegar (eds.) Experimental Phonology. Orlando, FL: Academic Press, 81-103. Keating, P. 1984. Aerodynamic modeling at UCLA. UCLA Working Papers in Phonetics 59: 18-28. 272
The phonetics and phonology of aspects of assimilation Kent, R. 1936. Assimilation and dissimilation. Language 12: 245-258. Key, T. H. 1852. On vowel-assimilations, especially in relation to Professor Willis's experiment on vowel-sounds. Transactions of the Philological Society 5: 192-204. Kohler, K. J. 1984. Phonetic explanation in phonology. The feature fortis/lenis. Phonetica 41: 150-174. Lindblom, B. 1986. Phonetic universals in vowel systems. In J. J. Ohala and J. J. Jaeger (eds.) Experimental Phonology. Orlando, FL: Academic Press, 13-44. Malecot, A. 1956. Acoustic cues for nasal consonants: an experimental study involving tape-splicing technique. Language 32: 274—284. 1958. The role of releases in the identification of released final stops. Language 34: 370-380. Mattingly, I. G. 1966. Synthesis by rule of prosodic features. Language and Speech 9: 1-13. Miiller, E. M. and W. S. Brown Jr. 1980. Variations in the supraglottal air pressure waveform and their articulatory interpretations. In N. J. Lass (ed.) Speech and Language: Advances in Basic Research and Practice. Vol. 4. New York: Academic Press, 317-389. Murray, R. W. 1982. Consonant cluster developments in Pali. Folia Linguistica Historica 3: 163-184. Ohala, J. J. 1974a. Phonetic explanation in phonology. In A. Bruck, R. A. Fox, and M. W. LaGaly (eds.) Papers from the Parasession on Natural Phonology. Chicago: Chicago Linguistic Society, 251-274. 1974b. Experimental historical phonology. In J. M. Anderson and C. Jones (eds.) Historical Linguistics II. Theory and Description in Phonology. Amsterdam: North Holland, 353-389. 1975a. A mathematical model of speech aerodynamics. In G. Fant (ed.) Speech Communication. [Proceedings of the Speech Communication Seminar, Stockholm, 1-3 Aug, 1974.] Vol. 2: Speech Production and Synthesis by Rule. Stockholm: Almqvist & Wiksell, 65-72. 1975b. Phonetic explanations for nasal sound patterns. In C. A. Ferguson, L. M. Hyman, and J. J. Ohala (eds.) Nasalfest: Papers from a Symposium on Nasals and Nasalization. Stanford: Language Universals Project, 289-316. 1976. A model of speech aerodynamics. Report of the Phonology Laboratory (Berkeley) 1: 93-107. 1978a. Phonological notations as models. In W. U. Dressier and W. Meid (eds.) Proceedings of the 12th International Congress of Linguists. Innsbruck: Innsbrucker Beitrage zur Sprachwissenschaft, 811-816. 1978b. Southern Bantu vs. the world: the case of palatalization of labials. Proceedings of the Annual Meeting of the Berkeley Linguistic Society 4: 370-386. 1979. Universals of labial velars and de Saussure's chess analogy. Proceedings of the 9th International Congress of Phonetic Sciences. Vol. 2. Copenhagen: Institute of Phonetics, 41-47. 1980. The application of phonological universals in speech pathology. In N. J. Lass (ed.) Speech and Language. Advances in Basic Research and Practice. Vol. 3. New York: Academic Press, 75-97. 1981. The listener as a source of sound change. In C. S. Masek, R. A. Hendrick, and M. F. Miller (eds.) Papers from the Parasession on Language and Behavior. Chicago: Chicago Linguistic Society, 178-203. 1983a. The phonological end justifies any means. In S. Hattori and K. Inoue (eds.) Proceedings of the 13th International Congress of Linguists. Tokyo, 232-243. [Distributed by Sanseido Shoten.] 273
JOHN J. OHALA
1983b. The origin of sound patterns in vocal tract constraints. In P. F. MacNeilage (ed.) The Production of Speech. New York: Springer Verlag, 189-216. 1985a. Linguistics and automatic speech processing. In R. De Mori and C.-Y. Suen (eds.) New Systems and Architectures for Automatic Speech Recognition and Synthesis. [NATO ASI Series, Series F: Computer and System Sciences, Vol. 16.] Berlin: Springer Verlag, 447-475. 1985b. Around yfo/. In V. Fromkin (ed.) Phonetic Linguistics. Essays in Honor of Peter Ladefoged. Orlando, FL: Academic Press, 223-241. 1986. Consumer's guide to evidence in phonology. Phonology Yearbook 3: 3-26. 1987. Explanation, evidence, and experiment in phonology. In W. U. Dressier, H. C. Luschiitzky, O. E. Pfeiffer, and J. R. Rennison (eds.) Phonologica 1984: Proceedings of 5th International Phonology Meeting, Eisenstadt, Austria. Cambridge: Cambridge University Press, 215-225. Ohala, J. J. and H. Kawasaki. 1984. Prosodic phonology and phonetics. Phonology Yearbook 1: 113-127. Ohala, J. J. and J. Lorentz. 1977. The story of [w]: an exercise in the phonetic explanation for sound patterns. Proceedings of the Annual Meeting of the Berkeley Linguistics Society 3: 577-599. Ohman, S. E. G. 1966. Coarticulation in VCV utterances: spectrographic measurements. Journal of the Acoustical Society of America 39: 151-168. Osthoff, H. and K. Brugmann, 1967. Preface to morphological investigations in the sphere of the Indo-European languages I. In W. P. Lehmann (ed. and trans.) A Reader in Nineteenth Century Historical Indo-European Linguistics. Bloomington: Indiana University Press. 197-209. [Originally published as Preface. Morphologische Untersuchungen auf dem Gebiete der indogermanischen Sprachen I. Leipzig: S. Hizrel, 1878, iii-xx.] Phelps, J. 1937. Indo-European si. Language 13: 279-284. Repp. B. H. 1976. Perception of implosive transitions in VCV utterances. Status Report on Speech Research (Haskins Laboratories) 48: 209-233. 1977a. Perceptual integration and selective attention in speech perception: further experiments on intervocalic stop consonants. Status Report on Speech Research (Haskins Laboratories) 49: 37-69. 1977b. Perceptual integration and differentiation of spectral information across intervocalic stop closure intervals. Status Report on Speech Research (Haskins Laboratories) 51/52: 131-138. 1978. Perceptual integration and differentiation of spectral cues for intervocalic stop consonants. Perception & Psychophysics 24: 471-485. Rothenberg, M. 1968. The breath-stream dynamics of simple-released-plosive production. Bibliotheca Phonetica 6. Schouten, M. E. H. and L. C. W. Pols. 1983. Perception of plosive consonants. In M. van den Broecke, V. van Heuven, and W. Zonneveld (eds.) Sound Structures. Studies for Antonie Cohen. Dordrecht: Foris, 227-243. Stevens, K. N. 1971. Airflow and turbulence noise for fricative and stop consonants: static considerations. Journal of the Acoustical Society of America 50: 1180-1192. 1972. The quantal nature of speech: evidence from articulatory-acoustic data. In E. E. David and P. B. Denes (eds.) Human Communication: A Unified View. New York: McGraw Hill, 51-66. Streeter, L. A. and G. N. Nigro. 1979. The role of medial consonant transitions in word perception. Journal of the Acoustical Society of America 65: 1533-1541. Thurneysen, R. 1961. A Grammar of Old Irish. Dublin: The Dublin Institute for Advanced Studies. 274
The phonetics and phonology of aspects of assimilation Wang, W. S.-Y. 1959. Transition and release as perceptual cues for final plosives. Journal of Speech and Hearing Research 2: 66-73. [Reprinted in: I. Lehiste (ed.). 1967. Readings in Acoustic Phonetics. Cambridge, MA: MIT Press, 343-350.] Westbury, J. R. 1979. Aspects of the temporal control of voicing in consonant clusters in English. Ph.D. dissertation, University of Texas, Austin. Weymouth, R. F. 1856. On the liquids, especially in relation to certain mutes. Transactions of the Philological Society [London]. 18—32. Winitz, H., M. E. Scheib, and J. A. Reeds. 1972. Identification of stops and vowels for the burst portion of /p,t,k/ isolated from conversational speech. Journal of the Acoustical Society of America 51: 1309-1317. Wright, J. T. 1986. The behavior of nasalized vowels in the perceptual vowel space. In J. J. Ohala and J. J. Jaeger (eds.) Experimental Phonology. Orlando, FL: Academic Press, 45-67. Zipf, G. K. 1935. The Psycho-biology of Language. An Introduction to Dynamic Philology. Boston: Houghton Mifflin.
275
15 On the value of reductionism and formal explicitness in phonological models: comments on Ohala's paper J A N E T B. P I E R R E H U M B E R T
In this paper, Ohala provides a nice case study in the style that he has become so well known for. He presents experimental data indicating that a regular phonological process, the direction of assimilation, is grounded in facts about speech production and perception. Such results are significant not only because of the light they shed on particular phenomena, but also as examples of research methodology. They point out the importance of seeking the right sphere of explanation for observed patterns in sound structure. The type of explanation which is featured in this paper is phonetic explanation, or explanation based on the physics and physiology of speech. Phonetic explanations are especially attractive because of their reductionist character; it is very satisfying to reduce psychology to biology, and biology to physics. Ohala's comparison of phonetic and nonlinear phonological accounts of assimilation links reductionism ("None of the terms of the explanation are unfamiliar, other-worldly entities") with generality (" A few primitives go a long way"). It is not clear to me that this link is well-founded, especially with respect to ongoing research. Nonlinear phonology has identified a number of principles which have great generality, although their physical basis is unclear. In particular, the principle of hierarchical organization has been shown to be a factor in the lexical inventory, phrasal intonation, and allophony rules of many languages. On the other hand, some parts of phonetics are extremely particular, from a scientific point of view. For example, there is no reason to suppose that the specific nonlinear oscillator responsible for vocal fold vibration has any generality from the point of view of physics. At this point, physics has no general theory of nonlinear oscillators; each new system that has been studied has given rise to new analysis techniques and interpretations. Vocal fold vibration is more interesting than other systems of similar mathematical complexity chiefly because of its role in human language. A researcher who is trying to decide how to use his time often has a choice of whether to aim for reduction or for generality. In some cases, paradoxically, aiming for generality is best even if reduction is the aim; making the description more 276
On the value of reductionism and formal explicitness
comprehensive and exact can narrow the class of possible underlying mechanisms. The field advances best if the judgement is made on a case-by-case basis, by assessing the feasibility and informativeness of the various methods that might be applied. I think we need to consider, also, the possibility that higher-level domainspecific theories may incorporate scientific insights which are lost (to the human mind, at least) in the theories which "explain" them. I am reminded of a talk I heard a few years ago in which Julian Schwinger discussed his experiences designing microwave guides during the war. He at first viewed this assignment as a trivial and uninteresting one, since the physics of microwave guides is completely specified by Maxwell's equations. However, although the behavior of microwave guides is indeed a solution to Maxwell's equations, their phenomenology proved to be so complex that an additional higher-level theory was needed to make it comprehensible. The interest of the assignment emerged in constructing this theory. In this case, the higher-level theory was desirable even though the explanation was already known. Such theories are doubly desirable when the explanation is being sought. In his summary comparison of phonetic and nonlinear phonological representations of sound patterns, Ohala says that neither claims to represent sound structure in a computationally efficient manner. However, computational models, whether efficient or not have been very important in our progress towards explaining speech. So a brief review may be worthwhile. On the phonetic side, the acoustic theory of speech production relied on calculations made using the first available digital computers. It demonstrated an approximation to speech production which can be computed efficiently, something recently emphasized in Fant (1985). This enabled it to support work on synthesis models, which have been so important to our understanding of speech perception, prosody, and allophony in continuous speech. It has also provided the basis for rigorous work on the limitations of the theory (cf. Fant, Lin, and Gobi 1985). On the other hand, the formalism for phonological rules developed in Chomsky and Halle (1968) was grounded in earlier work by Chomsky and others on the theory of computation. The Sound Pattern of English (SPE) showed considerable descriptive flair, but so did much earlier work. It advanced over earlier work chiefly by its algorithmic approach, and the SPE formalism was successfully applied in implementing the phonological rules of text-to-speech systems, both for English and for other languages (see Allen, Hunnicutt and Klatt 1987; Carlson and Granstrom 1976; Carlson and Granstrom 1986, and references given there). Work in metrical and autosegmental phonology has built on the theory of trees and connected graphs, which are also computationally tractable. This tractability has made it possible to move quickly to implementations which incorporate theoretical advances in nonlinear phonology, for example Church's (1982) syllable parser and work by Pierrehumbert (1979), Anderson, Pierrehumbert and Liberman (1984) 277
JANET B. PIERREHUMBERT
and Beckman and Pierrehumbert (1986) on synthesis of fundamental frequency contours. Such implementations of nonlinear representations have themselves led both to new observations and to theoretical innovations (see Pierrehumbert and Beckman 1988). Nonlinear phonology still presents one serious obstacle to computational implementations; it is not explicit about the interaction between derivational rules and well-formedness conditions. Compared to The Sound Pattern of English, the theory remains under formulated. This has been an obstacle to theoretical progress, too, since it has led to confusion about what claims are being made and what their consequences are for new data. One lesson which emerges from reviewing computational work on sound structure is the value of formalization. Unlike Ohala, I feel that formalization pays off for all types of descriptions. Formalizing nonlinear phonological descriptions pays off because it clarifies the issues and assists systematic evaluation, in part by supporting the construction of computer programs. Some descriptions may be just as good as others, but not all are equally good; some are just plain wrong. There is no point in seeking a phonetic or cognitive basis for spurious generalizations. It is important to keep in mind, also, the importance of formalizing phonetic descriptions. If we observe a parallel between some facts about speech and some physical law, we don't have an explanation, we have a conjecture. To have an explanation, it is necessary to write down the equations and determine the quantitative correspondence to the data. Only in this way can we find out if additional physical or cognitive mechanisms are crucially involved. Exact modeling will be especially important for determining the interplay of phonetic and cognitive factors, since the cognitive system can apparently exaggerate and extend tendencies which arise in the phonetics. A second lesson is the value of distinguishing levels of representation. For instance, one level in a synthesis system will have the job of specifying what linguistic contrasts in sound structure are possible. Another will specify how sounds are pronounced, in terms of the time course of acoustic or articulatory parameters. Yet another will specify a speech waveform. For the most part, different kinds of work are done at different levels, and so they are complementary rather than competing. However, competition does arise when the division of labor between levels is unclear. This is how most of the competition between nonlinear phonology and phonetics arises, in my opinion. For example, before Poser (1984), it was unclear whether Japanese had a phonological rule changing High to Mid after a pitch accent, or whether it had a phonetic rule reducing the pitch range after a pitch accent. Poser's experiments resolved this question. I feel optimistic that such issues will in general be resolvable by empirical investigation, and will not become mired in debates about philosophy and taste. A third observation, suggested especially by work on speech synthesis, is that different degrees of explanation are possible at all levels of representation. Nonlinear phonology explains the sound patterns in the English lexicon better 278
On the value of reductionism and formal explicitness than a word list does. This nonlinear explanation in turn requires further explanation; since (as Ohala points out) it is based on internal reconstruction, its cognitive basis is problematic and needs to be worked out. A similar gradation can be found within phonetics proper. The acoustic theory of speech production explains spectra of speech sounds by deriving them from the configurations of the articulators. But it too requires further explanation. Why does the linear approximation work as well as it does? Which articulatory configurations are possible in general ? And why does one configuration rather than another occur in any particular case? References Allen, J., M. Sharon Hunnicutt, and D. Klatt. 1987. From Text to Speech: The MITalk System. Cambridge Studies in Speech Science and Communication. Cambridge: Cambridge University Press. Anderson, M., J. Pierrehumbert, and M. Y. Liberman. 1984. Synthesis by rule of English intonation patterns. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing I: 2.8.1-2.8.4. Beckman, M. and J. Pierrehumbert. 1986. Japanese prosodic phrasing and intonation synthesis. Proceedings of the 24th Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 136-144. Carlson, R. and B. Granstrom, 1976. A text-to-speech system based entirely on rules. Conference Record, 1976 IEEE International Conference on Acoustics, Speech, and Signal Processing 686-688. Carlson, R. and B. Granstrom. 1986. Linguistic processing in the KTH multi-lingual text-to-speech system. Proceedings of the 1986 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2403-2406. Chomsky, N. and M. Halle. 1968. The Sound Pattern of English. New York: Harper and Row. Church, K. 1982. Phrase-structure parsing: a method for taking advantage of allophonic constraints. Ph.D. dissertation, MIT. Fant, G. 1960. Acoustic Theory of Speech Production. The Hague: Mouton (2nd edition 1970). 1985. The vocal tract in your pocket calculator. In V. Fromkin (ed.) Phonetic Linguistics. London: Academic Press. 55-77. Fant, G., Q, Lin, and C. Gobi. 1985. Notes on glottal flow interaction. STL-QPSR 2-3/85, Department of Speech Communication and Music Acoustics, Royal Institute of Technology, Stockholm, 21-41. Hertz, S. 1982. From text to speech with SRS. Journal of the Accoustical Society of America 72: 1155-1170. Pierrehumbert, J. 1979. Intonation synthesis based on metrical grids. Speech Communication Papers Presented at the 97th Meeting of the Acoustical Society of America, Acoustical Society of America, New York, 523-526. Pierrehumbert, J. and M. Beckman. 1988. Japanese Tone Structure. Linguistic Inquiry Monograph Series. Cambridge, MA: MIT Press. Poser, W. 1984. The phonetics and phonology of tone and intonation in Japanese. Ph.D dissertation. MIT.
279
16 A response to Pierrehumbert's commentary JOHN J. OHALA
I thank both Pat Keating and Janet Pierrehumbert for their thoughtful and constructive oral comments on my paper during the conference. I especially thank Janet Pierrehumbert for offering the above written commentary. Pierrehumbert raises the issue of the connection between generality and reductionism ("A researcher... has a choice whether to aim for reduction or for generality"). The connection or lack of it is, I think, a very simple matter. In explanations the link is obligatory. To explain the unfamiliar by reducing it to the familiar means to bring the unknown into the fold of the known and therefore to enlarge the domain to which the explicanda apply, thus achieving greater generality. An example is Watson and Crick's explanation of the genetic code and the mechanism of inheritance by reducing it to previously known chemical facts, e.g. how adenine bonds only with thymine and cytocine only with guanine in such a way as to guarantee the construction of exact copies of molecular chains (DNA) consisting of those substances. Needless to say, in cases like this it may take genius and inspiration to figure out which facts to bring together into an explanation. One can achieve generality without reduction but then this is a form of description. Explanation is deductive generalization; a systematic description of a sufficiently wide range of phenomena is inductive generalization. One of the points in my paper was that nonlinear phonology is a good description of certain sound patterns. Pierrehumbert seems to agree with me, then. Presumably we also agree that although at any given stage of the development of a scientific discipline, inductively-based generalizations are important and necessary, ultimately all disciplines strive to deduce their facts from first principles. Where we may disagree - and here we get "mired in debates about... taste"- is whether our discipline is at a stage where deduction of natural phonological processes is possible. I say that it is and have offered table 14.1 (and the accompanying references) in support of this. Everyone should follow their own hunches on this matter; I only hope that their choice will be based on a thorough understanding of what the phonetic explanations have to offer. 280
A response to Pierrehumberf s commentary
I am not sure I see the relevance of Pierrehumbert's example of the larynx as a nonlinear oscillator. Primitives - the things to which we try to reduce more complex behavior - are, so to speak, the building blocks of the universe. Within physics the larynx as an oscillator can be understood or constructed for the most part out of basic physical building blocks - this, at least, is what vocal-cord modelers such as Flanagan, Ishizaka, and Titze have done. Pierrehumbert's point that "there is no reason to suppose that the specific nonlinear oscillator responsible for vocal fold vibration has any general application in physics" may be true but uncontroversially so because as far as I know, no one has claimed otherwise. The claim - my claim - is the reverse: principles of physics apply to certain aspects of speech and language, including the behavior of the vocal cords. As for the microwave guide example, it is instructive but I do not think it helps to resolve the present point of contention. Although domain-specific investigations are invariably necessary for any specific task (in vocal cord modeling one must produce or assume a value for the elasticity of the vocal cords in order to make the rest of the model work), it can still be a matter of judgement or of contention which features of a model call for domain-specific treatment and which can be reduced to primitives of a more general sort. My position is that, as opposed to current practice, more of the problems that occupy phonologists today can and should be reduced to primitives from, say, physics, physiology, or psychology. To avoid misunderstanding let me say that I do not advocate reduction of phonological phenomena unless (a) the opportunity to do so presents itself (i.e. someone has a bright idea which makes the reduction possible) and (b) this reduction results in an increase in our understanding of the behavior in question. It is possible to give some account of articulator movements in terms of muscle contractions which in turn can be accounted for to some degree in terms of physical and chemical processes in the neuromotor system. But this may be neither necessary nor helpful to our understanding of how certain articulations are made. More to the point, there will always be gaps in our knowledge which prevent our understanding of some things. In other words, reduction will not be an option until someone is inspired to propose and test a hypothesis which specifies candidate primitives underlying the objects of our curiosity. I believe in opportunistic reductionism not obligatory reductionism. (See also Ohala 1986, 1987a, 1987b.) Pierrehumbert's review of computational models in linguistics is useful though it seems to have little to do with my paper. I agree with most of her remarks. The purpose of my table 2 was simply to suggest that there are a variety of criteria by which we might want to evaluate phonological models, including, perhaps, computational efficiency. It is doubtful that one model would be optimal for all tasks but whatever the task we should seek out the best methods to accomplish it. It would not contradict my claim that existing phonetic models do better than nonlinear notations at the task of accounting for natural sound patterns if it turned 281
JOHN J. OHALA
out that nonlinear phonology was better at the task of computational efficiency. In any case, Pierrehumbert does not seem to insist on this latter point. References Ohala, J. J. 1986. Consumer's guide to evidence in phonology. Phonology Yearbook 3: 3-26. 1987a. Experimental phonology. Proceedings of the Annual Meeting of the Berkeley Linguistics Society 13: 207-222. 1987b. Explanation, evidence, and experiment in phonology. In W. U. Dressier, H. C. Luschiitzky, O. E. Pfeiffer, and J. R. Rennison (eds.) Phonologica 1984. Cambridge: Cambridge University Press, 215-225.
282
17 The role of the sonority cycle in core syllabification G. N. CLEMENTS
17.1
Introduction
One of the major concerns of laboratory phonology is that of determining the nature of the transition between discrete phonological structure (conventionally, "phonology") and its expression in terms of nondiscrete physical or psychoacoustic parameters (conventionally, "phonetics"). A considerable amount of research has been devoted to determining where this transition lies, and to what extent the rule types and representational systems needed to characterize the two levels may differ (see Keating 1985 for an overview). For instance, it is an empirical question to what extent the assignment of phonetic parameters to strings of segments (phonemes, tones, etc.) depends upon increasingly rich representational structures of the sort provided by autosegmental and metrical phonology, or upon real-time realization rules - or indeed upon some combination of the two, as many are coming to believe. We are only beginning to assess the types of evidence that can decide questions of this sort, and a complete and fully adequate theory of the phonetics/phonology interface remains to be worked out. A new synthesis of the methodology of phonology and phonetics, integrating results from the physical, biological and cognitive sciences, is required if we are to make significant progress in this area. The present study examines one question of traditional interest to both phoneticians and phonologists, with roots that go deep into modern linguistic theory. Many linguists have noted the existence of cross-linguistic preferences for certain types of syllable structures and syllable contacts. These have been the subject of descriptive studies and surveys such as that of Greenberg (1978), which have brought to light a number of generalizations suggesting that certain syllable types are less complex or less marked than others across languages. We must accordingly ask how, and at what level these tendencies are expressed and explained within a theory of language.1 From the late nineteenth century onwards, linguists have proposed to treat generalizations of this sort in terms of a Sonority Sequencing Principle governing 283
G. N. CLEMENTS
the preferred order of segments within the syllable. According to this principle, segments can be ranked along a "sonority scale" in such a way that segments ranking higher in sonority stand closer to the center of the syllable and segments ranking lower in sonority stand closer to the margin. While this principle has exceptions and raises questions of interpretation, it expresses a strong crosslinguistic tendency, and represents one of the highest-order explanatory principles of modern phonological theory. A theory incorporating such a principle must give an adequate account of what "sonority" is, and how it defines the shape of the optimal or most-preferred syllable type. Up to now there has been little agreement on these questions, and phoneticians and phonologists have characteristically taken different approaches to answering them. Phoneticians have generally elected to focus their attention on the search for physical or perceptual definitions of sonority, while phonologists have looked for formal explanations, sometimes claiming that sonority has little if any basis in physical reality. It seems appropriate to reconsider these questions at this time, especially in view of the advances that have been made elsewhere in understanding syllable structure and its consequences for the level of phonetic realization. My purpose here will be to examine the status of sonority within phonological theory. I will propose that an adequate account of sonority must be based on a principle termed the Sonority Cycle, according to which the sonority profile of the preferred syllable type rises maximally at the beginning and drops minimally at the end (the term cycle will be used exclusively in this study to refer to this quasiperiodic rise and fall). We will see that this principle is capable of providing a uniform explanation not only for cross-linguistic generalizations of segment sequencing of the sort mentioned above, but also for an impressive number of additional observations which have not been related to each other up to the present time. Regarding its substantive nature, I will suggest that sonority is not a single, multivalued property of segments, but is derived from more basic binary categories, identical to the major class features of standard phonological theory (Chomsky and Halle 1968) supplemented with the feature "approximant."
17.2 The Sonority Sequencing Principle: a historical overview The notion that speech sounds can be ranked in terms of relative stricture or sonority can be found in work as early as that of Whitney (1865). However, the first comprehensive attempts to use such a ranking to explain recurrent patterns of syllable structure are due to Sievers (1881), Jespersen (1904), Saussure (1914), and Grammont (1933). Sievers observed that certain syllable types were commonly found in languages, while others differing from them only in the order of their elements were rare or nonexistant. For example, he noted that mla, mra, aim, arm were relatively 284
The role of the sonority cycle in core syllabification
frequent in languages, while Ima, rma, ami, amr were not. On this basis he assigned the liquids a higher degree of sonority than the nasals. Proceeding in this fashion, Sievers arrived at a ranking of speech sounds in terms of their inherent sonority. In a syllable consisting of several sounds, the one with the greatest sonority is termed the peak, or sonant, and the others the marginal members, or consonants. According to Sievers' sonority principle, the nearer a consonant stands to the sonant, the greater must its sonority be.2 In Jespersen's version of the theory, which is the more familiar today, the sonority principle was stated as follows: " In jeder Lautgruppe gibt es ebensoviele Silben als es deutliche relative Hohepunkte in der Schallfulle gibt" ("In every group of sounds there are just as many syllables as there are clear relative peaks of sonority") (p. 188). Jespersen's version of the sonority scale is given in (I): 3 (1)
1. 2. 3. 4. 5. 6. 7. 8.
(a) voiceless stops, (b) voiceless fricatives voiced stops voiced fricatives (a) voiced nasals, (b) voiced laterals voiced r-sounds voiced high vowels voiced mid vowels voiced low vowels
Drawing on the work of Sievers and Jespersen, we may state a provisional version of the Sonority Sequencing Principle as follows :4 (2)
Sonority Sequencing
Principle:
Between any member of a syllable and the syllable peak, only sounds of higher sonority rank are permitted.
Under this principle, given the sonority scale in (1), syllables of the type tra, dva, sma, mra are permitted, while syllables like rta, vda, msa, mla are excluded. Crosslinguistic comparison supports the view that clusters conforming to the Sonority Sequencing Principle are the most commonly occurring, and are often the only cluster types permitted in a given language. Clusters violating this principle do occur, as we will see, but they are relatively infrequent, and usually occur only in addition to clusters conforming to it. The theory of sonority just characterized was first developed in the late nineteenth century, when the notion of a synchronic grammar as a system of categories and representations defined at various degrees of abstraction from the physical data was yet to emerge in a clear way. The present revival of the sonority principle in phonological theory has taken place in a very different context, that of the initial period of response to Chomsky and Halle's Sound Pattern of English 285
G. N. CLEMENTS
(1968). Chomsky and Halle proposed, among other things, a major revision of the distinctive feature system of Jakobson, Fant and Halle (1952) which retained its binary character but reorganized the way sounds were classified by features. One aspect of the new system was a characterization of the traditional notion degree of stricture5 in terms of a set of binary major class features. These features - first identified as [sonorant, consonantal, vocalic], with [vocalic] later replaced by [syllabic] - were grouped together on the basis of their similar function in accounting for the basic alternation of opening and closing gestures in speech (Chomsky and Halle 1968: 301-302). In terms of their function within the overall feature system, these features played a role analogous to that of sonority in prestructuralist phonology. By excluding any additional feature of sonority from their system, Chomsky and Halle made the implicit claim that such a feature was unnecessary. The notion of scalar or multivalued features was first introduced into generative phonology by Foley (1970, 1972) as an alternative to binary feature systems. Foley's approach was intended as a radical alternative to the approach of Jakobson, Halle, and Chomsky. His main proposals were that (i) all binary features should be replaced by a set of scalar features, and that (ii) these scales do not refer to phonetic properties of segments, but are justified only by recurrent cross-linguistic aspects of segment behavior, as evidenced particularly in sound change. Foley's scalar feature of resonance (1972) is given in (3): (3)
1. 2. 3. 4. 5.
oral stops fricatives nasals liquids glides
6. vowels
Through its influence on Zwicky (1972), Hankamer and Aissen (1974), and Hooper (1976), all of whom cite Foley's work, this view of resonance gained wide currency and in its later adaptations came to have a substantial influence on the subsequent development of syllable theory within generative phonology. More recently, the Sonority Sequencing Principle in something close to its original version has had a general revival in the context of syllable phonology (major references include Hooper 1976; Kiparsky 1979; Steriade 1982; and Selkirk 1984). In a significant further development, Hooper (1976) proposed a principle according to which the sonority of a syllable-final consonant must exceed that of a following syllable-initial consonant (equivalently, the second must exceed the first in "strength"). This principle, originally proposed for Spanish, has been found to hold in other languages, though usually as a tendency rather than an exceptionless law (Devine and Stephens 1977; Christdas 1988), and has come to 286
The role of the sonority cycle in core syllabification
be known as the Syllable Contact Law (Murray and Vennemann 1983). We may state it as follows, using $ to designate a syllable boundary: (4)
The Syllable Contact Law: In any sequence Ca $ Cb there is a preference for Ca to exceed Cb in sonority.
17.3 Current issues in sonority theory In spite of its importance, sonority remains an ill-defined, if not mysterious concept in many respects, hence the urgency of reexamining its theoretical and empirical bases.6 Among the questions we would like to be able to answer are the following: how, exactly, is sonority defined in phonological theory? Is it a primitive feature, or is it defined in terms of other features ? What are its phonetic properties ? Assuming that some version of the Sonority Sequencing Principle is correct, at what linguistic level does it hold ? Over what morphological or prosodic domain are sonority constraints most appropriately defined? Can we define a single sonority scale valid for all language, or must we recognize a significant degree of cross-linguistic variation ? These are some of the questions that will be addressed in the remainder of this study.
17.3.1
At what level does the Sonority Sequencing Principle hold}
One important issue concerns the level at which the SSP holds. A surface-oriented version of the SSP might claim that it holds without exception of surface syllabification in all languages. In such a view, the SSP would project a unique and exhaustive syllabification over any arbitrary string of phonemes in surface representation, containing no prior annotations showing where the syllable peaks lie. In agreement with much of the more recent literature, I will suggest that such an interpretation of the SSP is incorrect, and that the SSP holds at a more abstract level than surface representation.7 Consider the representative examples in (5). The cases in (5a) represent sonority "plateaus," in which two adjacent consonants at the beginning or end of a word have the same sonority rank.8 In (5b) we find representative cases of sonority "reversals," in which the sonority profile first rises, then drops again as we proceed from the edge of the word inward. As these examples show, reversals can be cited involving all major segment classes: fricatives, liquids, nasals and glides. In (5c) we find cases in which the syllable peaks are not sonority peaks (at least not prior to the assignment of syllabicity), but are adjacent to elements of higher sonority. For example, the syllabic peak of English yearn is the liquid [r], which is adjacent to the glide [y] and thus does not constitute a sonority peak. Contrasting examples such as pedaller, pedlar show that syllable peaks are not fully 287
G. N. CLEMENTS
predictable in all languages on the basis of the surface context. The level of representation assumed here is approximately that of systematic phonetic representation prior to the application of (automatic or language-particular) phonetic realization rules, though as transcription practices vary from one writer to another and are often inexplicit, this assumption may not accurately reflect each writer's intention in all cases.9 (5)
a. Consonant sequences with sonority plateaus: English: apt, act, sphere Russian: mnu 'I crumple', tkut 'they weave', kto ' w h o ' , gd'e 'where' Mohawk: tkataweya't 'I enter', kka:wes 'I paddle' Marshallese: qqin 'to be extinguished', kksn 'to be invented', Iliw 'angry' b. Consonant sequences with sonority reversals: English: spy, sty, sky, axe, apse, adze German: Spiel 'game', Stein 'stone', Obst 'fruit', letzt 'last' Russian: rta 'mouth (gen.)', Iba 'forehead (gen.)', mgla 'mist' French: table [tabl] 'table', autre [otr] 'other' Cambodian: psa: 'market', spiy 'cabbage' Pashto: wro 'slowly', wlar 'he went', Imar 'sun' Ewe: yra 'to bless', wlu 'dig' Klamath: msas 'prairie dog', Itewa 'eats tules', toq'lGa 'stops' Mohawk: kskoharya'ks 'I cut dead w o o d ' Ladakhi: Ipaks 'skin', rtirj-pa 'heel', rgy9l9 'road' Kota: anzrcgcgvdk 'because will cause to frighten' Abaza: yg'yzdmlratxd 'they couldn't make him give it back to her' Tocharian A: ynes 'apparent', ysar 'blood' Yatee Zapotec: wbey 'hoe', wse-zi-le 'morning', wza-'a 'I ran' c. Syllables whose peaks are not sonority peaks: English: yearn [yrn], radio, pedaller [ped-l-r] vs. pedlar [ped-lr] German: konnen [kon-n] 'to be able', wollen [vol-n] 'to be willing' vs. Kbln [koln] 'Cologne' French: rel(e)ver [ra-j-ve] 'to enhance', troua [tru-a] '(s)he dug a hole' vs. trois [twa] 'three', hai [a-i] 'hated' vs. ail [ay] 'garlic' Spanish: pafs [pa-is] 'country', rfo [ri-o] 'river', bahfa [ba-i-a] 'bay'; fuimos [fwi-mos] ' w e went' vs. huimos [u-i-mos] 'we fled'; piara [pya-ra] 'herd' vs. piara [pi-a-ra] 'chirp' (past subj.) PIE: *wlkos ' w o l f (cited by Saussure as evidence against the sonority theory) Turkish: daga [da-a] 'mountain (dat.)' vs. dag [da:] 'mountain' (nom.) Berber: ti-wn-tas'you climbed on him', ra-ymm-yi 'he will grow', ra-tk-ti 'she will remember' Swahili: wa-li-m-pa 'they gave him', wa-i-te 'call them', ku-a 'to grow' vs. kwa 'of, at' Luganda: ttabi [t-tabi] 'branch', kkubo [k-kubo] 'path', ddaala [d-da:la] ' step' Bella Coola: nmnrnak 'both hands', mnmnts 'children', sk'lxlxc ' I ' m getting cold'
The facts are not at issue in most of these cases, and in many similar cases that 288
The role of the sonority cycle in core syllabification
could be cited. Some writers' accounts are detailed enough to allow us to accept the transcriptions with a very high degree of confidence. For example, Jaeger and Van Valin (1982) report that the glide [w] occurs regularly before obstruents in word-initial clusters in Yatee Zapotec (see examples in (5b) above). They argue that [w] cannot be reanalyzed as or derived from the vowel [u] for a number of reasons: (i) [w] has the acoustic properties of a glide, namely short duration and a rapidly moving second formant; (ii) [u] does not occur elsewhere in the language, underlyingly or on the surface; (iii) otherwise there are no vowel-initial words in Yatee Zapotec; (iv) [w] carries no tone, whereas all syllables otherwise have phonemic tone; (v) [w] usually functions as an aspect prefix, and all other aspect prefixes are consonants. As Jaeger and Van Valin point out, this latter fact helps to explain the survival of these rare cluster types, since the [w] carries essential morphological information that would be lost if [w] were deleted. Such problems for surface-oriented versions of the SSP were recognized in the earliest literature. Sievers, for example, noted the existence of anomalous clusters such as those in pta, kta, apt, akt in which the second member was a dental sound. He suggested that they might be due to the "ease of the articulatory transition" to a dental, but did not offer independent reasons for assuming that transitions to dentals were simpler than transitions to other places of articulation. Sievers had similar trouble dealing with the reversed-sonority clusters in syllables like spa, sta, aps, ats, and attempted to explain them by introducing the notion of the "secondary syllable" (Nebensilbe), a unit which counted as a syllable for the purposes of the SSP but not for linguistic rules, such as stress placement and the like. Jespersen proposed special principles to account for type (5c) violations involving rising sonority ramps (as in Spanish rio), but had nothing to say about falling ramps (as in Spanish pats). I will suggest that the SSP holds at deeper levels of representation than surface representation, and in particular that it governs underlying syllabification in the lexical phonology. More exactly, the underlying representations of any language are fully syllabified in accordance with certain principles of core syllabification, which are sensitive to sonority constraints (see further discussion in section 17.5.1). However, some consonants remain unsyllabified (or extrasyllabic) after the core syllabification rules have applied. Such consonants either become syllabified at a later point in the derivation, or are deleted. Such an analysis is strongly supported when we examine the phonological characteristics of sonority violations more closely, since we often find convincing evidence that they involve consonants that are not incorporated into syllable structure in the early stages of phonological derivations. In Sanskrit, for example, such consonants regularly fail to participate in reduplication, a fact which can be explained on the assumption that unsyllabified consonants are invisible to reduplication (Steriade 1982). In Turkish, Klamath, and Mohawk, among other
G. N. CLEMENTS
languages, such consonants regularly trigger the application of a variety of epenthesis rules (Clements and Keyser 1983; Michelson 1988); we can view these rules as having the function of incorporating unsyllabified consonants into syllable structure, and thus of simplifying representations. The fact that extra-syllabic elements tend to be removed from representations by rules of epenthesis is an instance of the more general principle that rules tend to apply in such a way as to replace complex representations with simpler ("euphonic," "eurhythmic") ones. Furthermore, in English and many other languages, violations of the sonority constraints are restricted to the edges of the syllabification domain, reflecting the preference of extrasyllabic elements for this position (Milliken 1988). Thus, for example, the sonority violation found in the final cluster of apt can be explained on the assumption that the [t] is extra-syllabic at the point where the sonority constraints apply; notice that syllable-final sequence [pt] is found only at the end of level 1 stems, which form the domain of core syllabification (Borowsky 1986). Finally, there is evidence that such consonants may remain unsyllabified all the way to the surface, in a number of languages which permit long, arbitrary strings of consonants in surface representation (see examples in (5b) and the references just below). Other, related ways of explaining or eliminating surface exceptions to the SSP have been proposed in the earlier literature. These include the restriction of the principle to the syllable "core" as opposed to "affix" (Fujimura and Lovins 1978), the recognition of a syllable "appendix" that lies outside the scope of sonority restrictions (Halle and Vergnaud 1980), the postulation of languageparticular rules that take precedence over the principle (Kiparsky 1979, 1981), and the treatment of clusters such as English sp, st, sk as single segments rather than clusters for the purposes of syllabification (Selkirk 1982). It is not always possible to find convincing independent evidence for such strategies in all cases, however, and it seems possible that a hard core of irreducible exceptions will remain. Languages exhibiting a high degree of tolerance for long, arbitrary sequences of consonants prominently include (but are not limited to) the Caucasian languages, several Berber dialects, numerous Southeast Asian languages and many languages native to the Northwest Pacific Coast.10 17.3.2
The phonetic basis of sonority
A further issue in sonority theory involves its phonetic basis. Given the remarkable similarity among sonority constraints found in different and widely separated languages, we might expect that sonority could be directly related to one or more invariant physical or psychoacoustic parameters. However, so far there exists no entirely satisfactory proposal of this sort. The problem is due in part to lack of agreement among phonologists as to exactly what the universal sonority hierarchy consists of. As Selkirk points out (1984), it will not be possible to determine the exact phonetic character of sonority until phonologists have come to some 290
The role of the sonority cycle in core syllabification
agreement about the identity of the hierarchy. But many proposals have been put forward at one time or another, and at present there are a great number of competing sonority scales to claim the attention of phoneticians. This problem should disappear as better theories of sonority are developed. But in addition to uncertainty regarding the linguistic definition of the sonority scale, there is some question whether a uniform, independent phonetic parameter corresponding to sonority can be found, even in principle. This because the various major classes of speech sounds have substantially different properties from nearly every point of view: aerodynamic, auditory, articulatory, and acoustic. It is true that a variety of phonetic definitions of sonority have been offered in the past, from the early proposal to define it in terms of the relative distance at which sounds could be perceived and/or distinguished (O. Wolf, cited by Jespersen 1904: 187) to a variety of more sophisticated recent suggestions (see for example Lindblom (1983) and Keating (1983) for discussion of an articulatory-based definition of sonority, and Price (1980) for discussion of an acoustic-based definition). But as Ohala and Kawasaki state (1985: 122), "no one has yet come up with any way of measuring sonority"- not at least a widely agreed-upon method based on a uniform phonetic parameters corresponding to a linguisticallymotivated sonority scale.11 The ultimate response to this problem may be to deny that sonority has any regular or consistent phonetic properties at all (Hankamer and Aissen 1974; Hooper 1976; Foley 1977). While this represents a position to which we might be driven out of necessity, we do not choose it out of preference, since in the absence of a consistent, physical basis for characterizing sonority in language-independent terms we are unable to explain the nearly identical nature of sonority constraints across languages. We should not be overly concerned about the difficulty of finding well-defined phonetic definitions of sonority, however. In the past, attempts to define linguistic constructs in terms of physical definitions (or operational procedures) have usually proven fruitless, and it is now widely agreed that abstract constructs are justified just to the extent that they are tightly integrated into the logical structure of predictive and explanatory theories. Thus, no adequate phonetic definition has ever been given of the phoneme, or the syllable - and yet these constructs play a central and well-understood role in modern phonology. Similarly, the notion of sonority is justified in terms of its ability to account for cross-linguistic generalizations involving phoneme patterning, and need not have a direct, invariant expression at the level of physical phonetics. The problem raised by our (at present) incomplete physical understanding of sonority reduces to the problem of accounting for why linguistically-motivated sonority rankings are very much the same across languages. I will suggest that the sonority scale is built into phonological theory as part of universal grammar, and that its categories are definable in terms of the independently-motivated categories of feature theory, as discussed in section 4. 291
G. N. CLEMENTS
17.3.3
The redundancy of the feature
"sonority"
A further issue concerns the feature characterization of sonority. A multivalued feature of sonority makes good sense as an alternative to the binary major class features of standard theory, but is harder to justify as a supplement to them, since sonority can be adequately defined in terms of these independently-motivated features (see discussion below), and is thus redundant in a theory containing them. One way of eliminating this redundancy is to eliminate the major class features themselves. This approach is considered by Selkirk, who suggests that the work of the major class features can done by (a) a feature representing the phonetic dimension of sonority, (b) the sonority hierarchy, and (c) the assignment of a sonority index to every segment of the language (1984: 111). Hankamer and Aissen are even more categorical, stating "the major class features of the standard feature system do not exist" (1974: 142). Most proponents of the sonority hierarchy have been reluctant to go this far, but there has been little explicit discussion of how we may justify the presence of two types of features with largely overlapping functions in feature theory.
17.4 The major class features and the definition of sonority On the basis of this review of the issues, let us consider how sonority can be incorporated into a formal theory of phonology and phonetics. As has previously been noted (Basboll 1977, Lekach 1979), the sonority scale can be defined in terms of independently-motivated binary features. I will adopt a definition involving the four major class features shown in (6), where O = obstruent, N = nasal, L = liquid, G = glide (the choice of this particular scale will be justified below). The sonority scale for nonsyllabic elements is derived by taking the sum of the plus-specifications for each feature. (6)
0 < — — — — 0
N< — — — +
L< — — + +
1 2
G — + + + 3
"syllabic" vocoid approximant sonorant rank (relative sonority)
Three of these features are familiar from the earlier literature. "Syllabic" can be interpreted as referring to the prosodic distinction between V and C elements of the timing tier, or alternative characterizations of syllable peaks in prosodic terms. For reasons given in Clements and Keyser (1983: 8-11, 136-7), I will take this feature to have no intrinsic physical definition, but to be defined in language292
The role of the sonority cycle in core syllabification
particular terms: "syllabic" segments are those which attract the properties of the syllabic nucleus in any particular language, while " nonsyllabic" segments are those which do not. "Vocoid," a term introduced by Pike (1943), is simply the converse of the traditional feature "consonantal" and is defined accordingly. "Sonorant" has its usual interpretation. In order to complete our definition of the sonority scale in terms of binary features we require a further feature, grouping liquids, glides and vowels into one class and nasals and obstruents into another. This is exactly the function of the feature "approximant," proposed by Ladefoged (1982) who defines it as "an articulation in which one articulator is close to another, but without the vocal tract being narrowed to such an extent that a turbulent airstream is produced" (1982: 10). In order to clearly exclude nasals (which do not involve a turbulent airstream), I will consider an approximant to be any sound produced with an oral tract stricture open enough so that airflow through it is turbulent only if it is voiceless.12 The recognition of approximant as a feature is justified by the fact that approximants tend to pattern together in the statement of phonological rules. For example, many languages allow complex syllable onsets only if the second member is an oral sonorant, i.e. an approximant in our terms. Similarly, nonapproximants often pattern together. In Luganda, only nonapproximants occur as geminates: thus we find geminate /pp, bb, ff, vv, mm/, etc., but not /ww, 11, yy/. We will treat approximant as a binary feature, like the other major class features. In Ladefoged's account (see Ladefoged 1982: 10, 38-9, 61-2, 256, 265), approximant is not a feature category, but a value or specification of the feature category stop. This category is a three-valued scalar feature whose values are stop, fricative, approximant. This system makes the prediction that the values stop and approximant are mutually exclusive, and thus do not allow any segment to bear both of these values at once. The crucial data here involve laterals, which in the feature classification given in Halle and Clements (1983) are classified as [ —cont], and in the present account must also be [ +approximant]. Under the view proposed here, then, / I / is both a stop and an approximant, and may function as such in one and the same language. This appears to be correct. For example, in English / I / functions with the other approximants in its ability to occur as the second member of complex syllable onsets: /pi, tr, kw/, etc., while nasals may occur in this position only after / s / . But / I / also patterns with nasals in the rule of intrusive stop formation, in which an intrusive stop is inserted between a nasal or lateral and the following fricative in words like den[t]se, falftjse, hamfpjster, wealftjthy. This rule involves a "lag" of the features [ — cont] and [place] onto the following segment (see Clements 1987a for discussion). The scale in (6) is incomplete in that it only provides for nonsyllabic segments, and not for syllabic segments, including vowels. In some languages, nasals and even obstruents can function as syllable peaks, in certain circumstances (for syllabic obstruents see Bell's 1978 survey article; Dell and Elmedlaoui 1985 for 293
G. N. CLEMENTS
discussion of a dialect of Berber; Clements 1986 for discussion of syllabic geminates in LuGanda; and Rialland 1986 for discussion of syllabic consonants derived through compensatory lengthening in French). In principle, any segment can occupy the syllable peak, but the ability of a given segment to function as a syllable peak is related to its rank on the sonority scale. Our model predicts the following ranking of syllabic segments: (7)
0 < N< L
+ — — +
4— + +
4+ 4+
syllabic vocoid approximant sonorant
3
4
rank
(Note that a syllabic glide is identical to a vowel, or to put it another way, a glide is simply a nonsyllabic vowel: cf. Pike 1943.) However, this does not quite accord with the facts, since as Bell has noted (1978), syllabic nasals are generally preferred to syllabic liquids in languages that have just one or the other. The notion "relative sonority," as defined in (7), does not therefore extend unproblematically to syllable peaks, which require separate discussion.13 In an alternative proposal, Van Coetsem (1979) has suggested reintroducing the feature "vocalic" alongside "syllabic." As he points out, this would correctly allow us to distinguish nasals and liquids by a major class feature, unlike the SPE feature system which must make use of the feature "nasal." In our proposal (and Ladefoged's), however, this task is accomplished by the major class feature "approximant." The major difference between the two proposals is that if we chose "vocalic" instead of "approximant," glides would be ranked at the same sonority level as liquids by the algorithm given above: (8)
0 < N < L = G < V — — — — + — — — + + — — 4- — 4= 444+
syllabic vocoid vocalic sonorant
0
rank
1 2
2
4
More importantly, a system with the feature "approximant" seems to reflect natural groupings of sounds better than one with "vocalic." Thus, liquids and glides ([ + approximant] nonsyllabics) frequently fall together as a class in the statement of rules, while obstruents, nasals and glides ([ — vocalic] nonsyllabics) rarely or never do. One of the primary functions of the feature [vocalic] in earlier feature systems was to designate the natural class of liquids, characterized as [4- vocalic, 4-consonantal] sonorants. However, this function is equally well 294
The role of the sonority cycle in core syllabification
served by a feature system containing [approximant], in which liquids are designated by the features [ +approximant, — vocoid]. I conclude, therefore, that the major class features are correctly represented as in (6) and (7). Combinations of major class features other than those given in (6) and (7) are non-occurring, and are excluded by the following universal redundancy rules: (9)
a. [ —sonorant] -> [ — approximant] b. [ — approximant]-> [ — vocoid]
These rules entail the following, by contraposition: c. [ +approximant] -> [ +sonorant] d. [ + vocoid] ->[ + approximant]
I will assume that these redundancy rules apply to the output of each phonological rule as well-formedness conditions, and readjust the values for approximant, vocoid and sonorant as necessary. The four major class features together with the redundancy rules given above define 21 natural classes (or 29, if we count single-member classes). These are conveniently suggested in terms of the following 2 x 4 array of segment types. Terms that can be enclosed in a vertically or horizontally oriented rectangle constitute a natural class. Three examples are given for illustration. (10) [ -h syllabic]:
0
N
L
V
[ — syllabic]:
0
N
L
G
The three boxes represent the class of syllabic consonants, the class of consonantal sonorants, and the class of nonsyllabic approximants, from upper left to lower right. Notice that a single step to the right or up results in one-degree increase in sonority rank. This array clearly shows the special status of the feature [syllabic] in the sonority scale: it is the only major class feature that crossclassifies all others. This shows that [syllabic] has a different status in feature representation than the true major class features, not functioning as a feature but as a prosodically defined position within the syllable, as suggested above. The sonority scale as given in (6) and (7) does not include a subdivision of obstruents into stops and fricatives, or into voiceless and voiced obstruents; nor 295
G. N. CLEMENTS
does it recognize a distinction between lateral and central liquids. This is because to date (see further discussion below), the best-motivated cross-linguistic generalizations involving sonority, such as Greenberg's, do not appear to require any further subdivision of the sonority scale. However, some individual languages have been proposed as motivating further subdivisions, such as one between voiced fricatives and other obstruents, or one between lateral and central liquids. To accommodate such languages, linguists have proposed to recognize more elaborate versions of the sonority scale including additional features such as [continuant], [voiced], [coronal], etc., and have proposed that the features relevant for the definition of sonority can vary from one language to another. This approach will not be adopted here, however. The explanatory value of sonority theory lies in its ability to predict valid cross-linguistic generalizations. As soon as we allow the sonority scale to vary in its identity from one language to another, we seriously undermine its explanatory role by increasing the number of ways in which it will accommodate potential exceptions, thus reducing the number of cross-linguistic generalizations that it accounts for. I will argue below that there are considerable advantages to maintaining a strong, predictive version of sonority theory. Much of the apparent evidence for language-particular variation in the sonority scale comes from observations which can be explained in other ways. For example, the tendency for voiced fricatives to be excluded as the first member of initial clusters, found in a number of languages including English, can be understood in terms of a principle (to be described in section 5.3) which holds that all else being equal, sequences containing less marked segments are favored over sequences containing more marked segments. Crosslinguistically, voiced fricatives are more marked than either stops or voiceless fricatives. This principle extends to many other distributional regularities that had previously been thought to require language-particular modifications of the sonority scale, such as the apparently greater sonority of coronal as opposed to noncoronal consonants in some languages. The strongest position consistent with the available evidence is that a single scale, perhaps O < N < L < G < V o r a simple variant of this, defines the unmarked order of segments within the syllable across languages, and that apparent deviations from this scale have independent explanations. 17.4.1 An alternative view, sonority as a multivalued feature What has tempted many linguists to consider sonority to be a single, multivalued feature is the fact that it arrays segment classes into a hierarchy. Other binary features in phonology do not seem to have this property. The notion of hierarchy, it can be argued, is most simply and directly expressed in terms of a single multivalued feature, rather than by making use of several binary features constrained by redundancy rules such as those in (9a-d). A multivalued feature of this sort is equally capable of capturing the necessary natural classes, if we allow 296
The role of the sonority cycle in core syllabification
that rules may refer to any continuous sequence of positions along the hierarchy. For example, given the sonority scale in (6) and (7), the natural class of liquids, glides and vowels can be referred to by the expression " [2-4 sonority]" (cf. Zwicky 1972, Selkirk 1984 for discussion of such a proposal). Let us consider the notion of hierarchy in general terms. Given a binary feature system, a hierarchy is defined whenever we have an implicational relation holding between two features. For example, given the two binary features F, G and the implication: [ + F] -> [ + G] (entailing [ — G] -> [ — F] by contraposition), we define a three-term hierarchy over the segment classes A, B, C: (11)
ABC -
+ [F] + + [G]
0
1 2 rank
The number of terms in the hierarchy increases as we add implicational relations. Thus, the four-term hierarchy O < N < L < G results from the presence of the two implicational statements (redundancy rules) (9a-b) given earlier. Hierarchies of this type are common elsewhere in grammar. For example, in some Bantu languages we find the following nominal hierarchy defined by the accessibility of a given nominal to direct object status: 1st person > 2nd person > 3rd person human > 3rd person animal > 3rd person inanimate (Hyman, Duranti, and Morolong 1980). In principle, it would be possible to define this hierarchy in terms of a single multivalued feature whose meaning is roughly "similarity or closeness to ego." However, the various positions on this hierarchy do not form a continuum, but a series of discrete steps, most of which are found to play a role elsewhere in grammar: for example, the distinction between first, second and third person animate commonly plays a role in Bantu inflectional morphology. In such cases linguists do not usually assume that a single, multivalued feature or parameter is at work, but rather that the hierarchical scale is built up out of independently-needed linguistic categories linked by implicational relations. This seems to be the appropriate way to view the sonority hierarchy. There is, moreover, considerable phonetic and perceptual rationale for the definition of the sonority scale given in (6) and (7) in terms of the major class features. "Sonority" is a composite property of speech sounds which depends on the way they are specified for each of a certain set of features. Plus-specifications for any of these features have the effect of increasing the perceptibility or salience of a sound with respect to otherwise similar sounds having a minus-specification, for example by increasing its loudness (a function of intensity), or making its formant structure more prominent. By defining the sonority scale in terms of several independent features rather than attempting to define it in terms of a single, uniform phonetic 297
G. N. CLEMENTS
parameter, we take a significant step toward solving the problem of "defining" sonority in phonetic terms. Moreover, we are able to relate the notion "relative sonority" directly to perceptibility, since each of the acoustic attributes associated with a plus-specification for a major class feature enhances the overall perceptibility of the sounds that it characterizes. (See Stevens and Keyser 1987 for recent discussion of the acoustic correlates of distinctive features in somewhat similar terms.) In sum, we may regard the major class features as defining the relative sonority of the various speech sounds in just this sense. Although the notion of relative sonority cannot be defined in terms of any single, uniform physical or perceptual property, we need not conclude that it is a fictitious or purely subjective matter, as long as we consider it a composite attribute of speech sounds, defined in terms of a set of major class features which themselves have relatively well-defined attributes.
17.5 The sonority cycle Let us now consider the way sonority-based constraints are to be formulated in core phonology. I will propose a model involving two principles, which I will term the principles of Core Syllabification and Feature Dispersion. These two principles, taken together, implement the principle of the Sonority Cycle. 17.5.1 The Core Syllabification Principle I will assume that there is a more or less well-defined portion of the lexical phonology characterized by certain uniform, perseverative properties. For example, in some languages the set of syllabification rules responsible for the syllabification of underlying representations reapplies to the output of each phonological and morphological operation throughout a portion of the lexical phonology; we may call these the core syllabification rules, after Clements and Keyser (1981, 1983). Similarly, by a principle of conservation some languages maintain a uniform phoneme inventory throughout much or all of the lexical phonology, an effect of " structure-preservation " which Kiparsky has proposed to account for in terms of marking conditions (Kiparsky 1985). Furthermore, in some languages, we observe constraints on segment sequences that hold both of nonderived stems and derived stems, giving rise to "conspiracy" effects that cannot be accounted for by syllabification principles alone (see Christdas 1988 for discussion of Tamil). The segmental and sequential uniformity characterizing these inner layers of the lexical phonology does not generally extend to the postlexical phonology, and does not necessarily even characterize the entire lexical phonology where violations of structure preservation (in the strict sense that precludes the introduction of novel segment types) are found in a number of languages (Clements 1987b). I will refer to the portion of the lexical phonology 298
The role of the sonority cycle in core syllabification
subject to such perseverative well-formedness conditions as the core phonology, and the syllabification rules that operate at this level the core syllabification rules. Let us consider the nature of core syllabification more closely. As has widely been noted, syllables are normally characterized by a rise and fall in sonority which is reflected in the sonority scale values characterizing each of their segments. Sequences of syllables display a quasiperiodic rise and fall in sonority, each repeating portion of which may be termed a sonority cycle. It is possible to fit a curve or outline over such representations which reflects this rise and fall, as shown in (12), consisting of two cycles: syllabic
(12)
vocoid approximant sonorant
The number of cycles whose peaks fall on the top ([syllabic]) line of this diagram will correspond exactly to the number of syllables, except that a plateau along the top line (representing a sequence of vowels) may be parsed as a sequence of syllable peaks, as in many of the examples of (5c). We may formulate a preliminary version of the Sonority Sequencing Principle in terms of this cyclic organization. It will be stated in terms of three steps or actions which are performed successively on segment strings to create syllables. The first of these searches for [ + syllabic] segments as defined by the language in question, and introduces a syllable node over them (cf. Kahn 1980). This step presupposes that syllabic segments are already present in the representation at this point, whether created by rule or underlying (as is required in the case of languages that have unpredictable distinctions between vowels and glides or other segments differing only in syllabicity, as in French, discussed below). Further segments are syllabified by first adding segments to the left that have successively lower sonority values, and then doing the same for unsyllabified segments on the right. This yields the following principle of unmarked syllabification. I will call it the Core Syllabification Principle (CSP) for reasons that will become clear in the subsequent discussion. (13)
The Core Syllabification Principle (CSP): a. Associate each [ + syllabic] segment to a syllable node. b. Given P (an unsyllabified segment) preceding Q (a syllabified segment), adjoin P to the syllable containing Q iff P has a lower sonority rank than Q (iterative). c. Given Q (a syllabified segment) followed by R (an unsyllabified segment), adjoin R to the syllable containing Q iff R has a lower sonority rank than Q (iterative). 299
G. N. CLEMENTS
The first iteration of (13b), which creates CV syllables, is not restricted by the sonority condition, since languages allowing syllabic consonants may permit segments of equal or higher sonority to be syllabified to their left: cf. English yearn, from underlying /yrn/ in which / r / is syllabic, and similar examples in other languages (see Steriade 1982 for further observations on the special status of CV syllables). A further necessary qualification is that some languages place upper limits on the length of initial and/or final clusters created by (13), as in Turkish which does not permit syllable-initial consonant clusters in native words.14 The "left-precedence" or "onset-first" principle rendered explicit by the precedence given to (13b) over (13c) is widely observed in languages. Sievers (1881) had already noticed the widespread tendency toward syllabifications of the form V.CV, where C is a single consonant or a "permissible initial cluster." This observation was generalized by later linguists as the Maximal Onset Principle, which states that intervocalic clusters are normally divided in such a way as to maximize syllable onsets (see Pulgram 1970; Bell 1977; Selkirk 1982, and others). This principle applies as a strong cross-linguistic tendency just as long as the result is consistent with the CSP and with any additional language-particular restrictions on syllable length or syllable composition.15 In accordance with this principle, template is syllabified as follows: (14)
syllabic
A A\ / V
// • t e (T
vocoid /
/ / •
m
\
+
+
/ •
+
\ A \
approximant sonorant
e y t P 1 ^L~-—" I — > ^—^ s ^SaX^'— (T
As a consequence of the Core Syllabification Principle, intervocalic clusters will be syllabified in such a way as to both maximize the length of syllable onsets and increase the difference in sonority between their first and last members. This follows from (13b), which due to its iterative nature will continue to adjoin consonants to the initial cluster as long as each new one added is lower in sonority than the previous one. A second consequence is that a syllable which is nonfinal in the domain of core syllabification will have a minimal decay in sonority, since less sonorous consonants to its right will normally have been syllabified into the following syllable by the prior application of (13b).16 Both points are illustrated in the syllabification of template, above, in which the Core Syllabification Principle requires p to syllabify rightward rather than leftward, giving the first syllable a relatively small decay in sonority at its end and the second a relatively sharp rise at its beginning. A third consequence is that syllables which are final in the domain of core syllabification should tend to show a maximal decay in sonority, since they 300
The role of the sonority cycle in core syllabification
do not compete for consonants with a syllable to the right. The right margins of final syllables should thus tend to resemble the "mirror image" of the initial margins of initial syllables as far as their sonority profiles are concerned. This prediction regarding codas in final syllables, though frequently true, is not exceptionless, however. In many languages the preferred syllable type is open, and closed syllables tend to be characterized by small rather than large drops in sonority finally as well as medially. When languages allow both sonorants and obstruents in final position, the set of obstruents which can occur there is frequently smaller than the set of permissible sonorants. It seems that the universally preferred syllable type tends to resemble the simple, open CV syllable as closely as possible; and a syllable approximates this type more closely to the extent that it declines less in sonority at its end. Thus a better characterization of the sonority cycle principle is that the preferred syllable type shows a sonority profile that rises maximally toward the peak and falls minimally towards the end,
proceeding from left to right. This principle expresses a valid cross-linguistic tendency, but does not exclude the presence of less preferred core syllable types in given languages. For example, many languages tolerate V-initial syllables, which begin with no rise in sonority at all. However, such syllables are normally restricted to word- or morpheme-initial position: in internal position, hiatus across syllable boundaries is very commonly eliminated in the core phonology by such processes as glide formation and vowel deletion. Similarly, many languages tolerate syllable types with abrupt drops in sonority at their end, or indeed syllables that have fairly complex final clusters that do not obey sonority sequencing restrictions at all. However, such clusters are generally restricted to final position in morphologically-defined domains. For example, English has a high tolerance for syllables ending in obstruent clusters (Dewey 1923; Roberts 1965). Within roots and level 1 stems, however, they are restricted to final position; internally, the inventory of syllable finals is much more restricted, and strongly favors sonorants over obstruents (Borowsky 1986).17 The suspension of normal sonority constraints in peripheral position in the domain of syllabification can be formally characterized in terms of the notion of extraprosodicity as governed by the Peripherality Condition (Hayes 1985; Harris 1983), but may have a deeper explanation in the observation that peripheral segments are not subject to competing syllable divisions, and thus cannot give rise to alternative syllable parsings. Viewed in this way, the sonority cycle provides a rationale for the ordering of (13b) over (13c) in the statement of the Core Syllabification Principle. This ordering, as already noted, reinforces the tendency of syllables to show a gradual decay in sonority toward their end. We will see shortly that a more precise characterization of the notion of the sonority cycle, implemented in terms of the Dispersion Principle, allows a significant simplification of the Core Syllabification Principle.
G. N. CLEMENTS
The Core Syllabification Principle is defined within the domain of core syllabification, which is fixed on language-particular grounds. This domain is the morphologically-determined portion of a form to which the core syllabification rules apply. Within the domain, the core syllabification rules and principles apply recursively to the output of each phonological or morphological operation. Thus in German, the domain of core syllabification can be identified as the morpheme (Laeufer 1985), while in English, as just noted, it is most likely identical to the stem formed by the level 1 morphology. As noted, the CSP operates only within the margin of freedom allowed by a particular language. Thus if a language does not allow initial clusters, an intervocalic cluster will usually be heterosyllabic, even if the second member of the cluster is higher in sonority than the first. Examples are Turkish and Klamath, whose syllable-sensitive phonologies always treat the first of two intervocalic consonants as closing the first syllable, regardless of the sonority profile of the cluster (Clements and Keyser 1983). Another type of constraint is illustrated in the Germanic languages, where it is widely observed that a short stressed vowel attracts a following consonant into its syllable (Murray and Venneman 1983; Laeufer 1985). This principle has a counterpart in the English rule of Medial Ambisyllabification (Kahn 1980), which applies without regard to the general preferences expressed by the CSP. Further, some languages systematically syllabify vowels and glides together to form diphthongs, even when the following segment is a vowel. Thus in English, the glide [y] in biology [bay'alaji] is syllabified with the first syllable, not with the second, as is evidenced by the failure of the first vowel to reduce to schwa by Initial Destressing. These are common ways in which language-particular rules may take precedence over the CSP. These rules themselves, it should be noted, are not arbitrary but reflect independentlyobserved tendencies, such as the widespread dispreference for tautosyllabic clusters, or the preference for stressed syllables to be heavy.
17.5.2
The Dispersion Principle
The Core Syllabification Principle expresses a generalization about the way sequences of segments are commonly organized into syllables. It classifies syllables into two types, those that conform to the CSP, and those that violate it by presenting sonority plateaus or sonority reversals. Most frequently, if a language has syllables that violate the CSP it also has syllables that conform to it. Accordingly we will call syllables that conform to the CSP "unmarked" syllables and those that violate it "marked" syllables. Apart from the two-way distinction between unmarked and marked syllable types, the CSP does not have anything to say about the relative complexity of syllables. This topic is treated in this section. Our basic claim will be that syllables are simple just to the extent that they conform to the optimal syllable as defined 302
The role of the sonority cycle in core syllabification
by the sonority cycle. Thus, the simplest syllable is one with the maximal and most evenly-distributed rise in sonority at the beginning and the minimal drop in sonority (in the limit case, none at all) at the end. Syllables are increasingly complex to the extent that they depart from this preferred profile. In order to characterize "degree of distance from the optimal syllable" in this sense, we will first define a measure of dispersion in sonority, and then formulate the Dispersion Principle in terms of it. This is the principle that will serve as the basis for ranking syllable types in terms of relative complexity. As stated here, it is defined only upon unmarked syllables, that is, those that show a steady rise in sonority from the margins to the peak; other ("marked") types of syllables must be ranked by a separate method of evaluation involving an extension of the complexity metric to be given below. In order to state the Dispersion Principle in the most revealing form, it proves convenient to make use of the demisyliable, a notion drawn from the work of Fujimura and his collaborators.18 I will begin by defining this term as it is used here. A syllable is divided into two overlapping parts in which the syllable peak belongs to both; each of these parts is termed a demisyllable. In the case of syllables beginning or ending in short vowels, one demisyllable is the short vowel itself. Thus, for example, the syllable [kran] consists of the demisyllables [kra,an], the syllable [spawl] of [spa,awl], the syllable [pa] of [pa,a], the syllable [ap] of [a,ap], and so forth. The demisyllable can be defined more formally as follows:19 (15)
A demisyllable is a maximal sequence of tautosyllabic segments of the form C m ...C n V or VC m ...C n , where n^m^O.
The idea underlying the use of the demisyllable is that the sonority profile of the first part of the syllable is independent of the sonority profile of the second part. That is, there are no dependencies holding between the two parts of the syllable as far as sonority is concerned. Thus the attribute "dispersion in sonority" is most appropriately defined over the demisyllable. If we now restate the principle of the sonority cycle in terms of demisyllables, and consider only unmarked demisyllables, we will say that the initial demisyllable maximizes the contrast in sonority among its members, while the,finaldemisyllable minimizes it. The contrast in sonority between any two segments in a demisyllable can be stated, as a first approximation, as an integer d designating the distance in sonority rank between them. For example, given the sonority scale O < N < L < G < V the distance in sonority rank between N and V is 3, regardless of their relative position in a demisyllable. The notion "dispersion in sonority" can be stated in terms of a measure of dispersion, Z), of the distances in sonority rank d between the various pairs of segments within a demisyllable. D characterizes demisyllables in terms of the 303
G. N. CLEMENTS
extent to which the sonority distances between each pair of segments is maximized: the value for D is lower to the extent that sonority distances are maximal and evenly distributed, and higher to the extent that they are less maximal or less evenly distributed. It can be defined by the following equation, which is used in physics in the computation of forces in potential fields, and is proposed by Liljencrants and Lindblom (1972) to characterize the perceptual distance between vowels in a vowel system. (16)
Here, d is the distance in sonority rank between each ith pair of segments in the demisyllable (including all nonadjacent pairs), and m is the number of pairs in the demisyllable, equal to n(n— l)/2, where n is the number of segments. It states that D, the dispersion in sonority within a demisyllable, varies according to the sum of the inverse of the squared values of the sonority distances between the members of each pair of segments within it. Assuming the sonority scale in (6) and (7), this gives the following values of D for simple CV and VC demisyllables: (17)
OV, VO NV, VN LV, VL GV, VG
= = =
006 0-11 0-25 100
For CCV and VCC demisyllables, we have the following: (18)
OLV, VLO ONV, VGO, OGV, VNO NLV, VGN, NGV, VLN LGV, VGL
= 0-56 =1-17 = 1-36 = 2-25
We observe that in terms of the sonority cycle, initial demisyllables with low values for D are those that show an optimal sonority profile, i.e. a sharp and steady rise in sonority, while final demisyllables with high values for D show the best profile, i.e. a gradual drop in sonority. We may accordingly state the Dispersion Principle as follows: (19)
Dispersion Principle: a. The preferred initial demisyllable minimizes D b. The preferred final demisyllable maximizes D
It can be noted in passing that other ways of defining the value of D are possible in principle. For example, it might be more appropriate to restate (16) over the sum of sonority distances for adjacent pairs of segments only. As it happens, this 304
The role of the sonority cycle in core syllabification Table 17.1 Complexity rankings for demisyllables of two and three members based on the sonority scale 0
a. Two-member demisyllables: i. initial: OV NV LV GV ii. final: VO VN VL VG b. Three-member demisyllables: i. initial: OLV ONV, OGV NLV, NGV LGV ii. final: VLO VGO, VNO VLN, VGN VGL
D
C
0-06 0-11 0-25
1 2 3 4
100
100
4 3 2 1
0-56 1-17 1-36 2-25
1 2 3 4
0-56 1-17 1-36 2-25
4 3 2 1
0-06 0-11 0-25
version of (16) gives only slightly different values of Z), since the value of d2 is always very small for nonadjacent pairs, and proves to yield no differences in actual demisyllable rankings. Other possible versions, involving some simple summation of the distance between members instead of the inverse of the square, prove not to yield the desired complexity rankings, and need not be discussed here. We may now define a Complexity Metric making use of the Dispersion Principle as stated in (19). This metric defines complexity rankings in terms of values of Z), and states separate conditions for initial and final demisyllables. (20)
Complexity Metric: For demisyllables of length /, a. the complexity ranking, C, of an initial demisyllable increases as its ranking in terms of D increases b. the complexity ranking, C, of a final demisyllable increases as its ranking in terms of D decreases
In the case of initial demisyllables of a given length, this metric will assign the rank 1 to the demisyllable with the lowest value of /), the rank 2 to the next highest, and so forth. The demisyllable OV, for example, has the lowest value for Z), and 305
G. N. CLEMENTS
Table 17.2 Complexity rankings for one-member demisyllables {compared to twomember demisyllable's)
a. initial: V
D
C
undefined
5
undefined
0
b. final: V
therefore the lowest complexity rank (1); NV has the second highest value for /), and thus the second lowest complexity rank (2); and so forth. Two-member demisyllables fall into four degrees of complexity, as do threemember demisyllables. Complexity rankings for two-and three-member demisyllables are shown in table 17.1. It should be noticed that C is not proportional to D itself, but rather to the ranking defined by D. (20) does not assign a value for C to one-member demisyllables (V), which nevertheless vary in complexity according to whether they constitute initial or final demisyllables just as longer ones do. We will therefore extend our measure of complexity in a natural way to account for these. An initial one-member demisyllable V must be regarded as highly complex as it fails to show any rise in sonority whatsoever. It must therefore be regarded as more complex than the most complex two-member initial demisyllable GV, which shows a slight (i.e. one step) rise in sonority. Since GV has a complexity rank of 4, we will assign the initial demisyllable V a complexity rank of 5. Afinalone-member demisyllable V, on the other hand, must be regarded as maximally simple since it conforms exactly to the pattern of the optimal CV syllable, showing no decline in sonority at all. We will give this demisyllable the complexity rank of 0, one step lower than the next most favored final demisyllable, VG. Thus we have the additional rankings in table 17.2 which rank one-member demisyllables with respect to two-member demisyllables. Four-member demisyllables fall into one of three complexity ranks, as shown in table 17.3. The longest demisyllables that can be evaluated by this procedure, assuming the scale O < N < L < G < V, are the singleton five-member sets, ONLGV and VGLNO, for which D = 503. The same system extends to demisyllables with syllabic consonants as peaks. Recall that all syllabic consonants have a sonority ranking of 1 more than their nonsyllabic counterparts, as was shown in (6) and (7). Thus for the case of demisyllables of length 2, for instance, we have the rankings in table 17.4.20 The complexity rankings in tables 17.1-17.4 define a hierarchy over demisyllables. We may now state the following implications for core phonology, which hold at the level resulting from initial syllabification, which I will call L(IS). 306
The role of the sonority cycle in core syllabification
Table 17.3 Complexity rankings for four-member demisyllables
a. initial demisyllables: ONGV OLGV, ONLV NLGV b. final demisyllables: VGNO VGLO, VLNO VGLN
D
C
2-53 2-67 3-61
1 2 3
2-53 2-67 3-61
3 2 1
T h e implications are stated only over demisyllables of the same type, where the type of a demisyllable depends on (i) whether it is an initial or final demisyllable, and (ii), what the segment type of its peak is (vowel, syllabic liquid, etc.). In addition, the Complexity Hierarchy is stated only over demisyllables of the same length (where V-demisyllables count as if they were of length 2, for the reasons explained just above). A separate Length Hierarchy is stated over demisyllables of different lengths. These two statements together form the Complexity and Length Hierarchies, stated in (21): (21)
a. The Complexity Hierarchy: For any given type t and length /, the presence in L(IS) of a demisyllable of complexity rank n implies the presence of a demisyllable of complexity rank n-1. b. The Length Hierarchy: For any given type t, the presence in L(IS) of a demisyllable of length / ( / > 2) implies the presence of a demisyllable of length /— /.
By the Length Hierarchy (21b), for example, the presence of a CCV demisyllable in L(IS) implies the presence of CV, and so forth for longer demisyllables. The Length Hierarchy does not project a ranking for V-demisyllables, since as just mentioned these count as representing length 2; instead, V-demisyllables are ranked with respect to others by (21a), which treats a final V-demisyllable as simpler than any VC demisyllable, and an initial V-demisyllable as more complex than any CV demisyllable, by table 17.2. Notice that it is only by placing Vdemisyllables under the scope of the Complexity Hierarchy in this way that we are to offer a principled account of the fact that V-demisyllables are more complex than CV initial demisyllables, but simpler than VC final demisyllables (rather than, for example, the contrary); if we were to rank them under the Length Hierarchy instead, this asymmetry in behavior would have to be accounted for in terms of an arbitrary stipulation. 307
G. N. CLEMENTS
Table 17.4 Complexity rankings for demisyllables with syllabic consonants as peaks
a. with syllabic liquids as peaks i. initial OL NL LL ii. final LO LN LL b. with syllabic nasals as peaks i. initial ON NN ii. final NO NN
D
C
0-11 0-25 1-00
1 2 3
0-11 0-25 100
3 2 1
0-25 1-00
1 2
0-25 1-00
2 1
These principles allow us to characterize a language as more or less complex according to the following properties of demisyllables occurring at L(IS): (22)
a. the maximal value of n in ( 2 1 a ) ; b. the maximal value of / in (21 b ) ; c. the presence of "marked" demisyllables (those violating the CSP).
The Complexity/Length Hierarchy (21) represents a claim about the organization of phonological systems at the level of core syllabification. It maintains that core syllabification rules do not create complex types unless they create the more simple syllable types. Surface exceptions to (21) arise as a result of segmental rules creating hew cluster types and later syllabification rules applying after the level of core phonology. Both types of surface exception can be illustrated from French. In French, we find surface syllables of several types. In the first place, we find the unmarked demisyllable types OLV (drap 'sheet', vrai 'true'), OGV (dieu 'god', chouette 'owl'), NGV (mieux 'best', nuage 'cloud'), and LGV (rien 'nothing', lieu 'place', rouan 'roan', hi 'him'), as well as full range of CV demisyllables. In the second place, we find demisyllable types such as OOV (style 'style', sphere 'sphere', psychose 'psychosis') and more rarely ONV, NNV (pneu 'tire', mnemonique 'mnemonic'); in addition we find a few ^-initial CCCV demisyllables, such as spleen 'spleen', strict 'strict'. The second group can be identified as nonbasic syllable types due to the fact that they are restricted to initial 308
The role of the sonority cycle in core syllabification
position in the syllabification domain: thus we do not find tautosyllabic sC clusters internally in morphemes and simple stems (Lowenstamm 1981). We need consider only the first set, therefore, all of which may occur word-internally as well as wordinitially and which are accordingly good candidates for core syllables at the level Among unmarked demisyllables of length 3, then, we find OLV and three types of CGV syllables: OGV, NGV, and LGV. Missing are ONV and NLV. Since the presence of LGV (of complexity rank 4) implies the presence of ONV and NLV (of complexity ranks 2 and 3, respectively), we have an apparent violation of (21). It remains to determine, however, whether CGV demisyllables are actually present in L(IS). Glides and vowels are underlyingly contrastive in French, but this contrast is restricted to words like abbaye [abei] 'abbey', abeille [abey] 'bee', where the glide is final; we find no comparable contrasts in prevocalic position. For example, we find cahot [kao] 'jolt' and caillot [kayo] 'clot' but no contrastive word [kaio]. Surface GV syllables ordinarily derive from underlying VV sequences, since such syllables behave as vowel-initial with respect to rules that distinguish consonants and vowels. Thus we find les [lez] amis 'the friends' contrasting with les [le] copains 'the pals', illustrating the fact that the final [z] of les is deleted before consonants; [z] is retained, however, before the surface glide [y] in les yeux [lezyo] showing that this must be a vowel at the time z-deletion applies (see Clements and Keyser 1983, 96—99 for fuller discussion). We conclude that GV syllables do not occur at the level L(IS) and therefore that initial demisyllables of length 2 are restricted to a maximum complexity of 3 at this level. (Note, however, that a small number of loanwords allow initial underlying glides, such asyod 'yod', whisky 'whisky'; cf. les [le] yods, les [le]
whiskys.)
By the principle of resolvability (Greenberg 1978: 250; Clements and Keyser 1983: 47-8), the presence of a tautosyllabic cluster ABC implies the independent occurrence of tautosyllabic AB and BC. If CGV demisyllables were created by core syllable rules at L(IS), we would have a violation of this widely-observed principle, since as just shown GV does not occur independently at this level. As CGV syllables do not contrast with CVV syllables, however, we may eliminate them from the level of initial syllabification L(IS) and derive them from the CVV demisyllables by the rule of Glide Formation, which turns high vowels into glides before vowels. This rule accounts not only for the presence of (C)GV in monosyllabic roots, but for alternations such as manie [mani] 'I handle' vs. mamer [manye] 'to handle', or avoue [avu] 'he admits' vs. avouer [avwe] 'to admit' (see e.g. Dell 1980; Noske 1982). This leads us to the following analysis of core demisyllables in French. The maximal complexity for initial demisyllables of length 2 is 3 (with a few wordinitial exceptions in nonnative words, as mentioned) and the maximal complexity for demisyllables of length 3 is 1, the default value for this case. Thus initial 309
G. N. CLEMENTS
syllabification creates only OV, NV, LV, and OLV, consistently with (21), CGV demisyllables arise through the rule of Glide Formation, which applies obligatorily in initial and postvocalic position and optionally postconsonantally. For some, but not all speakers it is also obligatory when defined entirely within a single morpheme (for such speakers lieu 'place' is always [lyo], never [lie]). The output of Glide Formation is fully syllabified, but respects the length constraints which continue to operate through the core phonology: thus it cannot create CCGV demisyllables, and is blocked in words like plier 'bend', crier 'cry', and grief 'grievance', which remain bisyllabic. Interestingly, for some speakers Glide Formation can apply in ^-initial words like skier [skye] 'to ski'. We may assume that for these speakers s-initial clusters are created by a post-core rule syllabifying initial s with a following consonant; at this point the core syllable constraints are no longer operative. For other speakers, this rule belongs to the core phonology. We see, then, that surface exceptions to (21) may not be exceptions at the level of initial syllabification, at which (21) is defined. In French, surface exceptions arise in two ways: through the creation of new sequence types by the operation of Glide Formation in the core phonology, which are resyllabified subject to the length restrictions, and through the creation of new syllable types (such as s-initial clusters) by syllabification rules applying subsequently in the derivation, perhaps in the post-core phonology. This analysis directly captures the generalization that the length condition on the output of Glide Formation is identical to the length condition holding on underlying syllables. More generally, it supports our claim that sonority constraints are most suitably defined in core phonology, rather than in surface structure (section 17.3.1). Let us turn finally to the status of "marked" demisyllables containing violations of the sonority cycle, in the form of sonority "plateaus" or "reversals" such as OOV, NOV, or LOV. Such structures are not uncommon at the surface-phonetic level in languages, as we saw in section 17.3.1, and may arise through core or postcore syllabification processes as we have just seen in French. The essential observation here is that such sonority violations are usually restricted to the periphery of the syllabification domain, where they do not give rise to problems of syllable division. A language that exhibits LOV demisyllables word-initially, for instance, does not usually tolerate them word-internally, just as languages that permit VOO demisyllables word-finally do not usually allow them in non-final syllables. This observation does not require any new principles, since it follows directly from the CSP. Word-initially, the CSP syllabifies a sequence like LOV as L-OV, where the L remains extrasyllabic. Some languages may then have special rules allowing this segment to be incorporated into the syllable, while others may require it to be deleted. Word-internally, however, the sequence VLOV will be syllabified VL-OV by the CSP; this will be true even if the language in question
The role of the sonority cycle in core syllabification
has a rule creating LOV demisyllables word-initially, under the assumption that syllabification rules restricted to domain edges apply after rules that implement the CSP. Therefore LOV demisyllables will not be created word-internally, except in the highly unusual case in which a language has a rule overriding the CSP in just this context, for example by carrying out a ^syllabification. There is a straightforward way to determine the relative complexity of demisyllables containing sonority "plateaus" and "reversals," which have not so far been integrated into the evaluation system. The basic observation is that the deviance of "marked" demisyllables is proportional to their distance from "unmarked" demisyllables. Thus sonority reversals are more complex than sonority plateaus, and the complexity of sonority reversals increases in proportion to the extent of the reversal: e.g. NOV is more complex than OOV, LOV than NOV, GOV than LOV, etc. We may assume an appropriate extension of the Complexity Metric to cover these cases. This section has proposed a formal procedure for quantifying the relative complexity of demisyllables of various types - and hence (though derivatively) of syllables. We need not attribute such computation to the explicit knowledge of native speakers in any sense. Rather, the relationships we have sought to bring out are properties of the representations as such, and can presumably be apprehended by speakers without carrying out conscious mathematical calculations - just as we can detect whether billiard balls are evenly dispersed on a billiard table without doing computations on a pocket calculator. 17.5.3 The Sequential Markedness Principle Certain sequencing constraints holding within syllables cannot be accounted for by the theory developed so far. Let us consider cases in which place of articulation seems to play a role. Greenberg (1978) observes what he terms the "law of the final dental-alveolar," which he formulates as follows: "every language [in the sample] with final clusters contains at least one cluster with a final obstruent in the dental-alveolar region" (p. 268). That is, if a language allows VCC demisyllables to occur in final position, at least one of these is of the form VCT, where T represents a dental or alveolar obstruent. Examples given by Greenberg include Classical Greek, with the three final clusters ps, ks and nks, Latin whose final clusters all end in s or r, Balti with ks, rs, ws, and Maasai with only r«, rt, and rd. A similar implication holds of initial demisyllables: as Greenberg notes, "every language [in the sample] with initial clusters contains at least one cluster with an initial consonant in the dental-alveolar region " (269). As an example he cites Chiricahua Apache, in which the only initial clusters are st and sd. Some linguists have suggested, on the basis of observations similar to these, that
G. N. CLEMENTS
coronal segments should be assigned a special rank of their own on the sonority scale. This would allow coronals to be formally treated as different in sonority from segments formed at other places of articulation. Closer consideration shows, however, that this approach weakens the notion of sonority to an undesirable degree, and does not explain the special status of anterior coronals (dentals and alveolars) compared to posterior coronals (palato-alveolars). One reason not to assign coronals a special place of their own on the sonority scale is that the distinction between coronal and noncoronal segments of the same major class does not correspond to the required difference in perceptibility, unlike the major class features which define the scale given in (6) and (7). For example, [s] is a more salient segment than [f] or [6] in terms of intensity and loudness, and is thus presumably more sonorous in this view, but nevertheless occurs peripherally to [p] and [k] in initial clusters like English spit, skit and in final clusters like lapse, tax where the theory requires it to be less sonorous. Nor, in particular, does such an approach help explain why [s] and [z] frequently occur peripherally to fricatives at other places of articulation, as in English sphere, Jeeves or Dutch school [sxo:l], aardigst [...xst] 'nicest'. If we are to maintain that coronals are less sonorant than noncoronals on the basis of the patterning of [s], we must abandon the claim that sonority is related to increased perceptibility, which seems otherwise correct. Moreover, it is difficult to find any general position for coronals that would give a correct general account of their exceptional freedom of occurrence. On the one hand, to handle initial clusters like sp, sf, sky sx or final clusters like pt, kt, ps, fs, ks, xs we would have to assign the coronals a lower sonority rank than noncoronals, as we have just seen. But there are considerations arguing for just the opposite analysis. As Steriade (1982) observes, we may account for the common exclusion of the initial clusters tl, dl in languages otherwise permitting OL clusters freely by the minimal distance principle if we claim that r, d have a higher sonority rank than />, b, k, g: under this assumption, t, d are "closer" in sonority to / than are the noncoronal stops, and hence can be excluded by a minimal distance constraint. And as Selkirk notes (1984), we must assign coronals a higher rank than noncoronals to account for languages such as Spanish and Italian in which only sonorants and s may close the syllable, in order to designate this set of segments as a natural class on the sonority scale. But such inconsistency in the place of coronals argues strongly against this approach. Furthermore, it seems that one and the same language may treat coronals inconsistently. In English, as we have seen, coronals typically pattern peripherally to noncoronals in initial and final obstruent clusters, a fact which suggests that they have a lower rank on the sonority scale. This is supported by Stuart Milliken's observation (personal communication) that in single morphemes, an obstruent may follow an oral stop only if it is a coronal, regardless of whether the 312
The role of the sonority cycle in core syllabification
cluster is intervocalic or final: thus we have intervocalic clusters such as those in chapter, capsule, abdomen, pretzel, factor, and pixel beside final clusters such as
those in rapt, lapse, ritz, fact, tax, but no words with stop-initial clusters ending in noncoronals like chapter, ratp, etc.21 If we regard coronals as ranking lower in sonority than noncoronals, this will follow from the CSP and the Syllable Contact Law (section 17.2). On the other hand, other facts in English argue that coronals are higher ranking than noncoronals. For example, [s] and [z] are the only fricatives that can precede a noncoronal oral stop in a morpheme: whisper, whisker, lisp, risk but *whifper, whifker, lifp, rifk; this can be explained under the Syllable Contact Law and the Minimal Distance Principle only if they are higher in sonority than the noncoronal fricatives. Moreover, English shares the property of many languages mentioned above, according to which t, d are excluded in initial clusters before /; as pointed out, this also follows from the Minimal Distance Principle if coronal stops are higher ranking than noncoronal stops. For these reasons it seems undesirable to introduce further subdivisions of the sonority scale to accommodate distinctions in place of articulation. Examining the relevant facts more closely, it seems that another principle may be better able to provide an explanation. The observations made so far show that in initial and final clusters, anterior coronals have a freer privilege of occurrence than other consonants do; they are often the only segments able to occur as the first or second member of clusters. It would seem reasonable to relate this fact to an independent property of anterior coronals, which is that they are formed at the least marked place of articulation, by most markedness criteria (see Stevens and Keyser 1987 for recent discussion). The complexity of any sequence of segments can be considered a function both of its length and of the individual segments that compose it. Thus although a two-member cluster is more complex than a one-member cluster, it is less complex if it contains an anterior coronal than if it contains some other consonant, all else being equal. Normal markedness principles, therefore, lead us to expect exactly the pattern of preference for anterior coronals that we have observed. We can make this observation explicit in terms of the following principle, which presumably does not need to be stated as an axiom in grammatical theory as it should follow from an adequate, completely elaborated theory of markedness: (23)
Sequential Markedness Principle: For any two segments A and B and any given context X_Y, if A is simpler than B, then XAY is simpler than XBY.
Thus pt is simpler than pk by virtue of the fact that / is simpler than k, and so forth. This principle extends to most of the observations we have made above. Clearly, however, its scope is much broader. Beside explaining the preference for
G. N. CLEMENTS
sk over fk, for example, it also explains the general preference for clusters containing s as opposed to all other coronal fricatives: s is the least marked fricative. This explains why initial clusters in English include sn, sm but not fn,fm, 9m, On, and only marginally sn, sm. Since voiceless fricatives are less marked than voiced fricatives, it also explains the presence of initial fr, fl, sn beside the absence of vr, vl, zn, etc. Thus we need a principle like (23) in any case. But once we have it, it accounts for the preference for dentals and alveolars without the need for further elaboration of the theory of sonority.22
17.6 Theoretical results Let us consider some of the general results and cross-linguistic predictions of our approach to sonority. These are taken up under five headings: 6.1 Sonority Sequencing Restrictions; 17.6.2 the Maximal Onset Principle; 17.6.3 Minimal Distance Constraints; 17.6.4 the Syllable Contact Law; and 17.6.5 Core Syllable Typology. 17.6.1
Sonority sequencing restrictions
As has already been pointed out, the account of sonority given above is based in the first instance on cross-linguistic generalizations of the sort noted by Greenberg (1978). These generalizations strongly support the sonority scale O < N < L < G < V. I summarize Greenberg's main results in (24)-(26), below.23 These examples consist of implicational statements of the general form, "if a language has property A, it also has property B." We can symbolize such statements by means of expressions of the form " A ^ B , " or "A implies B." Statements of this type are often understood as providing an indication of the relative markedness of the two properties in question, the unmarked (or less marked) value appearing to the right of the arrow. I present Greenberg's results under three general headings, which subsume Greenberg's implicational statements. (These headings are my own, not Greenberg's.) Under (24) I have grouped a number of implications supporting the proposition that the unmarked order of segment types within an initial demisyllable is ONLGV, and within a final demisyllable VGLNO. This proposition follows from the sonority scale O < N < L < G < V and from the CSP (13), which plays the important role of distinguishing between "unmarked" and "marked" demisyllable types. For example, since "marked" LOV demisyllables are not formed by the CSP, they require the complexity of an extra syllabification rule, and are furthermore ranked as several degrees more complex than "unmarked" demisyllables such as OLV created by the CSP, by the extension of the complexity metric suggested at the end of section 17.5.2. As (24) shows, Greenberg's results strongly support this proposition, in the sense that nine of his statements are entailed by it and none are inconsistent with it. (I retain Greenberg's numbering).
The role of the sonority cycle in core syllabification (24)
The unmarked order of segment types in a demisyllable is ONLGV or VGLNO: (17) LOV >OLV (18) VOL^VLO (19) GOV >OGV; VOG^VGO (24) LNV^NLV (25) VNL, VNN, VLL^VLN (36') VNN^VNO
Under (25) I have grouped further statements relating to the proposition that segments within the initial demisyllable tend to be equally and maximally distributed in sonority. Two implicational statements are entailed by this proposition and none contradict it. This proposition does not follow from the Core Syllabification Principle itself: in particular, the three demisyllable types mentioned in these statements are equally consistent with this principle, which establishes no ranking among them. This proposition does, on the other hand, follow from the Dispersion Principle (19), as we have already seen. (25)
Segments within the initial demisyllable tend to be equally and maximally distributed in sonority: (33) NLV^OLV (37) ONV^OLV
The converse of this proposition, that segments in final demisyllables tend to be minimally or unequally distributed in sonority, is contradicted by one statement, (34), according to which VLN -> VLO. How is this discrepancy to be explained ? We have already observed the common operation of rules that append extrasyllabic segments to the ends of syllabification domains. Such rules commonly create highly marked clusters, as in English, German, and many of the languages surveyed by Greenberg in which initial and final obstruent clusters occur with fairly high frequency. Segments appended by these rules are often coronal obstruents, a fact which reflects the unmarked status of these segments (cf. the Sequential Markedness Principle), as well as the fact that obstruents are just those segments that will least often create violations of the CSP when appended to the beginning or end of a domain. Since these properties follow from our other principles, however, it is not necessary to introduce new principles to deal with them. We therefore consider the preference for VLO demisyllables over VLN demisyllables in word-final position as reflecting the operation of rules that append extra-syllabic segments, preferentially obstruents, to the ends of domains. We expect this preference not to obtain in nonfinal syllables. Finally, in (26) I give a number of statements supporting the view that fricatives and stops should be considered equal in rank, just as is claimed by the scale O < N < L < G < V. In these statements, " S " is to be read "stop" and " F " as "fricative". The general proposition supported by Greenberg's results here is that
G. N. CLEMENTS
sequences differing in their specification for [continuant] are preferred to sequences agreeing in this specification. (26)
Contrast (7) (8) (9) (10)
in continuancy is favored over its absence: SSV>FSV, SFV VSS >VFS, VSF FFV >FSV, SFV VFF >VFS, VSF
This same principle may be able to account for the widely-observed preference for demisyllables of the form [trV] or [drV] over demisyllables of the form [tlV] or [dlV]. In the first of these, the consonant cluster contrasts in terms of the feature [continuant], while in the second it does not, assuming the correctness of our earlier assumption that laterals are [ — continuant]. Interestingly, Greenberg's results do not support the common view that voiced obstruents outrank voiceless obstruents in sonority. The reason for this is that obstruent clusters show a strong tendency to share all laryngeal features, including voicing. We see, then, that the principles developed here account correctly for the crosslinguistic generalizations noted by Greenberg, including certain ones (25) that did not follow from earlier versions of sonority theory. In addition, they make further predictions regarding preferred segment order that cannot be directly confirmed on the basis of Greenberg's study (which did not attempt to evaluate all possible orderings of the segment classes O, N, L, G, V), and which must be the subject of future research.
17.6.2 The Maximal Onset Principle This approach also allows us to derive a further generalization, the Maximal Onset Principle. In section 17.5.1, the Maximal Onset Principle was stipulated as part of the statement of the Core Syllabification Principle, by giving statement (13b) precedence over (13c). We observed that this order of precedence was in accordance with the properties of the sonority cycle, but we were unable at that point to derive it from any higher-level principle. We are now in a position to see that at least part of this principle (and indeed its most prototypical case) follows in a straightforward manner from the Dispersion Principle: VCV is preferably syllabified V-CV, not VC-V, since V is a simpler final demisyllable than VC, and CV is a simpler initial demisyllable than V (see Table 17.2). This account extends to VCCV sequences as well. For example, the preference for the syllabification V-OLV instead of VO-LV owes to the fact that V is a simpler final demisyllable than VO, as just noted. Thus V-OLV is a simpler sequence that VO-LV by virtue of the crosslinguistic preference for open syllables, formally expressed by the statement in Table 17.2(b). (Note that 316
The role of the sonority cycle in core syllabification
in the two syllabifications under comparison, OLV cannot be ranked with respect to LV since the Complexity Metric (20) does not compare demisyllables of different length.) This account extends to VCCV sequences of all types, and predicts that the syllabification V-CCV will be preferred to VC-CV just in case CCV is an admissible core demisyliable type in the language in question. Given these results, it is no longer necessary to give (13b) explicit precedence over (13c). The CSP can be restated as in (27): (27)
Core Syllabification Principle (revised): a. Associate each [ +syllabic] segment to a syllable node b. Give P (an unsyllabified segment) adjacent to Q (a syllabified segment), if P is lower in sonority rank than Q, adjoin it to the syllable containing Q (iterative).
We may allow syllabification to take place simultaneously rather than directionally, with the Complexity Metric deciding between otherwise well-formed alternatives. Significantly, the present theory differs from standard versions of the Maximal Onset Principle in being defined in terms of the universal sonority scale rather than in terms of language-particular syllabification rules. This means that in cases of alternative syllabifications, the correct one will normally conform to the universal sonority scale. Again, evidence bearing on this claim is hard to find, but some recent studies suggest that it may be correct. Hayes (1985) finds that intervocalic sequences of [s] + oral stop in English tend to syllabify as VC-CV in spite of the fact that we find [s] + oral stop clusters word-initially (see, however, Davidsen-Neilsen 1974 for contrary results), and Lowenstamm (1981) reports similar observations for French. 17.6.3 Minimal distance constraints A further result concerns what we may term "minimal distance constraints." It has been noticed by a number of linguists that not all syllables that are well-formed in terms of the Sonority Sequencing Principle actually constitute well-formed syllables in given languages. Some languages show a strong preference for syllables in which adjacent elements are not too close to each other in sonority rank. For example, Harris (1983) observes that in Spanish, initial clusters of the form ON and NL are systematically excluded, while those of the form OL are allowed. He suggests that this is not an arbitrary property of Spanish, but reflects a tendency for languages to prefer syllables in which adjacent elements are separated by a specifiable minimal distance on the sonority scale. As he further points out, if the sonority scale for Spanish is taken to be O < N < L < G < V, then we may say that Spanish requires adjacent consonants in the same syllable to be nonadjacent, i.e. to observe a minimal distance of 2, on the sonority scale. To the extent that statements of this sort prove to be simple and uniform across languages, they can be taken as providing further confirmation for the essential
G. N. CLEMENTS
correctness of sonority theory, without which these relations cannot be easily expressed. However, a number of observations suggest that this principle needs some qualification. First, minimal distance constraints only seem to apply in the initial demisyllable; they typically do not govern final demisyllables, where segments tend to be close to each other in sonority, as we have seen. Furthermore, to the extent that it has been applied to a wider set of languages, this principle turns out to require increasingly idiosyncratic, language-particular versions of the sonority hierarchy in order to be made to work (see Steriade 1982; Harris 1983; Selkirk 1984; van der Hulst 1984; and Borowsky 1986, for discussion of minimal distance constraints in a variety of languages involving several different sonority scales). While the notion that segments within the syllable should not be too similar in terms of their sonority rank undoubtedly offers valid insights into syllable structure, its formalization in terms of minimal distance constraints may not be the most satisfactory way of capturing these intuitions. The approach given here derives the main effects of minimal distance constraints without raising these problems. To see this, let us consider an example. Spanish, as noted, requires that consonants in initial clusters observe a minimal distance of 2 on the sonority scale. In a theory formally incorporating minimal distance constraints, such statements are part of the grammar, and the minimal distance value governing syllabification in each language must correspondingly be discovered by each language learner. Under such an account, the simplest possible language would be one with no minimal distance constraints at all. But this seems incorrect: minimal distance constraints appear to be quite widely observed across languages, and seem to represent the unmarked option. If this is so, we would prefer an account in which such constraints need not be stated explicitly in the grammar, but can be derived from independent principles. Under the present theory, such an account is possible. We may describe a language such as Spanish by saying that its initial CCV demisyllables have a maximal complexity of 1. Thus the only permitted initial demisyllables are of the form OLV (see Table 17.1). If we assume that 1 is the default value in universal grammar, this value does not have to be learned. We can account for more complex cases (those of languages which tend not to observe minimal distance constraints) by assuming that the learner abandons the default hypothesis only in the face of clear evidence to the contrary. For example, if a language allows ONV or OGV demisyllables in addition to OLV demisyllables, the value of the most complex demisyllable rises from 1 to 2, and the learner abandons the null hypothesis. This result follows as a consequence of the principles given earlier, and provides a straightforward account of the valid empirical core of the notion of "minimal distance," while accounting for the skewing between initial and final demisyllables.24 A further result is that by stating the Dispersion Principle over demisyllables rather than over consonant clusters (as in earlier approaches), we are
The role of the sonority cycle in core syllabification
able to bring the syllable peak into the domain of our statements, and account for the general preference for LV demisyllables over GV demisyllables.25 17.6.4 The Syllable Contact Law The theory presented above not only derives the effects of the Sonority Sequencing Principle intrasyllabically, it also derives the Syllable Contact Law transsyllabically. This principle, it will be recalled, holds that the preferred contact between two consecutive syllables is one in which the end of the first syllable is higher in sonority than the beginning of the second. In an extended version of this principle, Murray and Vennemann (1983) propose that the optimality of two adjacent, heterosyllabic segments increases in proportion to the extent that the first outranks the second in sonority. In this view, a sequence such as am.la, for example, constitutes a lesser violation than a sequence such as at.ya. Their version of the principle is paraphrased in (28): (28)
The Extended Syllable Contact Law (after Murray and Vennemann 1983, 520): The preference for a syllabic structure A$B, where A and B are segments and a and b are the sonority values of A and B respectively, increases with the value of a minus b.
This statement extends the Syllable Contact Law to syllable contacts of all types, including V$C. The consequence is that sequences like at.a exemplify the worst possible syllable contact and a.ta the best. This fully general version of the principle gives us the following implicational ranking of syllable contacts, in which the contact types improve as we proceed upward and rightward across the table: (29) V G L N 0
V V.V G.V LV N.V O.V
G V.G G.G LG N.G O.G
L V.L G.L LL N.L O.L
N V.N G.N LN N.N O.N
0 V.O G.O LO N.O 0.0
In the present theory, neither the Syllable Contact Law nor the Extended Syllable Contact Law need be stated separately, but follow from the principle of the Sonority Cycle as characterized in the earlier discussion. Suppose we view the complexity of any given syllable contact as a linear function of the complexity of each of its component demisyllables, taken individually. The ranking in (29) then follows straightforwardly from the complexity metric for individual demisyllables proposed in tables 17.1-17.4. To see this, let us assign an aggregate complexity score to each of the contact types in (29) calculated as a sum of the complexity values of each of the demisyllables that constitute it. We need consider only sequences in which neither demisyllable has more than two members, since this is the prototypical case. Thus the contact N.G (representing the demisyllable sequence VN.GV) is assigned a
G. N. CLEMENTS
score of 7, since the first demisyllable has a complexity value C of 3 and the second a complexity value C of 4 (see table 17.1). Proceeding in this way, we may construct a matrix from the table given in (29) by entering the appropriate scores for each contact type. We see that the optimality of a given contact type is a simple function of its aggregate complexity: CJI
V G L N 0
6 7 8 9
G L 4 3 4 6 5 7 6 8 7 CJI
V
(30)
N 2 3 4 5 6
0 1 2 3 4 5
17.6.5
Core syllable typology
In Clements and Keyser (1983), it was pointed out that the inventory of core syllable types is subject to certain widely observed constraints. The following types of core syllable inventories are commonly found across languages (where each C and V can represent a potential cluster): (31)
Type Type Type Type
I: II: III: IV:
CV CV, V CV, CVC CV, V, CVC, VC
On the other hand, other logically possible types of core syllable inventories are rare or lacking: (32)
a. b. c. d. e. f. g. h.
V, VC CVC, VC CV, V, VC CV, CVC, VC CV, V, CVC CV, VC V, CVC V, VC, CVC
Thus we find many languages whose core syllable types fall into the set of categories in (31), but few or none whose core syllable types correspond to those in (32). Clements and Keyser point out that the attested sets in (31) are characterized by two logical implications: a closed syllable type implies an open syllable type, and a vowel-initial syllable type implies a consonant-initial type. The CV syllable type is universal, as it is implied by all the others. These relations follow from the principles of markedness presented above. By the Complexity Hierarchy stated in (21a), closed syllables imply open syllables because final VC demisyllables imply final V demisyllables; and similarly, V-initial syllables imply C-initial syllables because V-initial demisyllables imply CV-initial 320
The role of the sonority cycle in core syllabification
demisyllables. Skewed inventories such as the one in (32c) are excluded by the fact that core syllabification rules are defined on demisyllables, not syllables: thus the presence of a CV initial demisyllable and a VC final demisyllable in the core phonology is sufficient to determine the presence of a CVC core syllable type.
17.7 Residual problems This section examines two residual problems: the treatment of geminates and other linked sequences, and the place of the major class features in the feature hierarchy. 17.7.1
The special status of linked sequences
We have said nothing as yet about a significant set of exceptions to the principles of syllable contact discussed in section 6.4. Many languages allow just a small set of intervocalic consonant clusters, typically including geminates and homorganic NC (nasal-I-consonant) clusters. Indeed, some languages, including Japanese, Southern Paiute, and Luganda, allow only these. As Prince points out (1984: 242-243), the generalization seems to be that languages otherwise eschewing heterosyllabic consonant clusters may allow them just in case they involve linked sequences', sequences sharing a single set of features. More precisely, what seems to be required is that the adjacent consonants share the place of articulation node. This is exactly what geminates and homorganic NC clusters have in common, as is shown by the following, simplified diagrams of the sequences [tt] and [nt], respectively (for this notation see Clements 1985): (33)
a. root: supralaryngeal: place: [coronal]:
Intuitively, what makes these sequences simple is the fact that they involve only a single specification for place of articulation; indeed there is some evidence that the NC clusters may be gesturally equivalent to a single consonant at the same place of articulation in some languages (see Browman and Goldstein 1986 for English and Chaga, but cf. Fujimura and Lovins 1978, note 43 for contrary results for English). We conclude from these observations that intersyllabic articulations involving a single place specification are simpler than those involving two (or more) place specifications. This principle must clearly take precedence over the sonority principles stated earlier. (On the other hand, sonority considerations may help to explain the fact that the C in NC clusters is almost always an obstruent.) Why are geminates heterosyllabic instead of tautosyllabic ? In other words, why
G. N. CLEMENTS
is a word like totta universally syllabified tot-ta rather than to-tta ? The answer to this lies in the CSP. As it scans the skeletal tier, and CSP syllabifies leftward as far as possible, first adjoining the second half of the geminate with the final vowel. It cannot syllabify the first half of the geminate with that vowel, since both skeletal C-elements dominate a single segment and thus have the same sonority rank. Consequently the first half of the geminate syllabifies with the preceding vowel. That it syllabifies into the preceding demisyllable at all, rather than e.g. being deleted due to its low sonority rank, reflects a general principle, overriding sonority considerations, to the effect that linked material is syllabified whenever possible (Christdas 1988).26 17.7.2
The status of the major class features in the feature hierarchy
A further question concerns the status of the major class features in the feature hierarchy. In the view presented in Clements (1985), major class features are placed under the domination of the supralaryngeal node. By assigning the major class features to the supralaryngeal node rather than to the root node, we predict that laryngeal "glides" - segments which have only laryngeal specifications - are not ranked in any position on the sonority scale, and are not characterized for any major class features. This seems correct from a cross-linguistic perspective. Laryngeals tend to behave arbitrarily in terms of the way they class with other sounds, avoiding positions in syllable structure that are available to true glides, and patterning now with obstruents, now with sonorants in a way often better explained by their historical origin in any given language than by their inherent phonological properties. In assigning these features to separate tiers, however, we predict that they should be able to engage in assimilatory spreading. As pointed out independently by Schein and Steriade (1986) and Bruce Hayes (personal communication), the support for this prediction is at present quite thin. An alternative view is that they have the status of "annotations" on the supralaryngeal class node, in the sense that they are features characterizing this node, but which are not arrayed on separate tiers. This assumption would entail that major class features spread if and only if the supralaryngeal node spreads.27
17.8 General discussion Let us review the answers we have proposed to some of the questions raised at the outset of this study: 1. How is sonority defined in phonological theory? Is it a primitive; or is it defined in
terms of other, more basic features ? We have proposed that sonority is not a primitive phonological feature, but a derived phonological property of representations definable in terms of the categories [syllabic, vocoid, approximant, sonorant]. 322
The role of the sonority cycle in core syllabification 2. What are its phonetic properties ? We have suggested that the phonetic correlates of sonority are just those of the major class features which define it, which share a "family resemblance" in the sense that all of them contribute to the overall perceptibility of the classes of sounds they characterize. 3. At what linguistic level do sonority sequencing constraints hold} We have proposed that the constraints are primarily defined in core phonology (more specifically, at the level of initial syllabification L(IS)), where syllabification obeys the Complexity/Length Hierarchy. Later rules, especially those applying at the periphery of the syllabification domain, may introduce new, more complex syllable types which create surface exceptions to the sequencing constraints. 4. Over what units are sonority constraints defined} It has been shown that sonority constraints are defined over demisyllables, rather than over syllables or other subsyllabic units. The demisyllable is necessary and sufficient to the statement of sonority constraints. 5. Can languages vary in their choice of sonority scales ? We have argued that a single sonority scale, that given in (6)-(7) or a simple variant of it, characterizes sonority in all languages. Apparent language-particular variation may reflect the effect of the Sequential Markedness Principle, which holds that if only a subset of a particular class is allowed in some position, it will be the least marked subset. 6. Can syllable types be ranked along a scale of complexity ? It has been argued that demisyllables (and derivatively, syllables) can be ranked along a scale of complexity according to the principle of the Sonority Cycle, which holds that the preferred syllable shows a sharp rise in sonority followed by a gradual fall. This principle is supported by the range of evidence discussed in section 17.6. This approach is further supported by the simplification it allows in the description of particular languages. The core syllable inventory of a given language is largely determined by a small number of variables: (34)
i.
the domain of core syllabification (nonderived stem, derived stem at level n, word, etc.); ii. type of permitted syllable peaks (V, L, N,...); iii. maximum length of each demisyllable type; iv. maximum degree of complexity of each type of demisyllable (predicts the presence of all the less complex demisyllables of that type).
In addition, languages may have core syllabification rules defining well-formed "marked" demisyllable types; filters specifying systematic gaps in the set of wellformed demisyllables; and perhaps rules defining the occurrence of permissible extrasyllabic elements ("appendices", "affixes") in domain-peripheral position. It is likely, however, that such rules play a less important role in core syllabification than has previously been thought. These results have consequences for questions regarding the formalization of syllable representation. There are many current views concerning the nature of 323
G. N. CLEMENTS
subsyllabic constituency, and the nature of the evidence supporting one or another of these views is not always as clear or straightforward as we might like. In contrast to some previous studies of sonority-based distributional constraints, the present theory claims that sonority contours are evaluated over the domain of the demisyllable, rather than that of the onset and rhyme. Indeed, the results summarized in this paper can be obtained only if we take the demisyllable as the domain of sonority constraints, since most of them make crucial reference to CV subsequences. For example, the Maximal Onset Principle requires that VCV be syllabified preferentially as V-CV rather than as VC-V, since V is a simpler final demisyllable than VC and CV is a simpler initial demisyllable than V (for any value of C). Second, we must take the initial demisyllable (rather than the onset cluster) as the basis of sonority if we are to express the preference rankings for different types of CV demisyllables stated in tables 17.1 (a)/17.2 (a) and for the different types of CCV demisyllables stated in table 17.1 (b), and hence express preferences for initial demisyllables with full generality. Third, our ability to derive prototypical cases of the Syllable Contact Law requires that sonority constraints be stated on the initial demisyllable rather than the onset cluster, since the calculation of aggregate complexity scores of VC-CV contact types depends crucially on the complexity rankings of the component VC and CV demisyllables. Finally, the theory's prediction of the implications of Core Syllable Typology also makes crucial reference to the initial demisyllable, since these implications follow from the relative complexity of CV, V, VC demisyllables. A further claim of this theory is that sonority-based dependencies should not hold between different demisyllables. Thus (without further qualification of the theory) we would not anticipate finding sonority-based dependencies holding between initial and final demisyllables, such that an initial demisyllable having a sonority profile of type A fails to combine with a final demisyllable whose sonority profile is of type B. Nor should we expect to find sonority-based dependencies holding across syllable boundaries. These predictions are correct, as far as I know. Thus, for example, we have found that apparent "syllable contact" dependencies are derivable from an independently-motivated metric needed to express the relative complexity of individual demisyllables, and require no separate
17.9
Conclusion
The notion of the Sonority Cycle as developed above provides us with a basis for explaining the striking and significant regularities in syllable structure that we find across languages, and for integrating these observations into a formal theory of syllable representation, allowing us to capture many generalizations that have up to now been inadequately understood or explained. Our results suggest that a significant cross-linguistic regularity of phonological 324
The role of the sonority cycle in core syllabification
structure (the Sonority Cycle) may be most clearly revealed at a level of representation considerably removed from surface representation (or acoustic reality), but that this principle has a regular expression at the phonetic level through the mediation of the major class features which provide its vocabulary. Such a conclusion should not be surprising in view of the fact that what is perceptually real for native speakers may differ in significant ways from the speech signal itself; indeed this has been the lesson of phonological studies since the emergence of the modern concept of the phoneme in the work of Sapir, Trubetzkoy, Jakobson and others in the early 1930s. This result by no means implies a divorce between linguistics and phonetics, but rather takes us a step further toward solving the long-standing enigma of how abstract linguistic form is communicated through the medium of the speech waveform: significant patterning relations may be encoded at a certain degree of abstraction from the physical data, but must have a regular manifestation in the speech signal if they are to be successfully conveyed from speaker to hearer.
Notes I would like to thank Harry van der Hulst, John McCarthy, Stuart Milliken, Donca Steriade, and participants in a seminar on syllable phonology given on three occasions during 1985-1986 at Cornell University, the University of Washington, and the Summer Linguistics Institute at the University of Salzburg, for their valuable critical reactions to various presentations of the ideas in this paper. I am further grateful to Annie Rialland for discussion of the French data, and to Mary Beckman, Osamu Fujimura, and John Kingston for their written commentary on earlier drafts. All of these have contributed in some way to improvements in style and substance, although they do not necessarily agree with its conclusions. Earlier versions of this paper were presented at Yale University in November, 1985, at the Annual Meeting of the Linguistic Society of America in Seattle, Washington, in December, 1985 and at the Workshop on Features, Wassenaar, The Netherlands in June, 1986. 1 Cross-linguistic generalizations such as these, at varying degrees of abstraction from the primary data, provide the explicanda of theory construction in linguistics, and form the basis of the hypotheses and models that eventually come to constitute a formal linguistic theory, i.e. a theory of possible grammars and optimal grammars. 2 Sievers distinguished between the Drucksilbe, conceived of as an articulatory-defined syllable produced with a single independent expiratory pulse, and the Schallsilbe, an auditorily-defined syllable determined by the relative audibility or sonority (Schallfulle) of its members. These two criteria do not always coincide, as is evidenced by the German (or English) word Hammer which constitutes one Drucksilbe but two Schallsilben. Of these, the Schallsilbe is most relevant to sonority theory. See Bloomfield (1914) for a brief summary of Sievers' ideas, in which the two syllable types are termed "natural syllable" and "stress syllable." 3 Jespersen's scale differed from Sievers' in ranking all voiceless sounds before all voiced, in not attributing a separate rank to voiceless stops and fricatives, and in assigning nasals and laterals the same rank. 4 The version of the Sonority Sequencing Principle given here follows Sievers rather than Jespersen, as this is the version that is most widely followed today, cf. e.g. Kiparsky 325
G. N. CLEMENTS
5 6 7
8
9
10
11
12
(1979) and Lowenstamm (1981). Jespersen allowed elements of equal sonority to be adjacent within the syllable. His reluctance to adopt the more restrictive version may have been motivated by the common occurrence of initial clusters like st and final clusters like ts, which constitute anomalies under Sievers' formulation, but not under Jespersen's where s and / are of equal rank and may thus occur adjacent to each other. See Pike (1942: 137-148) for a presentation of this notion, as well as Catford (1977) for more recent discussion. There has been relatively little critical discussion of the notion of sonority in the recent literature; a notable exception is Bell and Saka (1983). Early proponents of the theory, such as Sievers and Jespersen, did not distinguish between underlying and surface representation, and consequently assumed a surfaceoriented version of the principle. Discussion in the context of generative phonology has generally recognized that the SSP interacts with other rules and principles which may give rise to surface-level exceptions. For example, Kiparsky (1979, 1981) notes that the SSP may be overridden by language-particular rules, while Fujimura and Lovins (1979) allow exceptions within syllable "affixes" that lie outside the "core." This statement must be qualified by the observation that the identity of the sonority scale varies in detail from one linguist to another. What is a sonority reversal for one writer may be a sonority plateau for another, and what is a sonority plateau for one may constitute an ascending or descending ramp for another. This qualification extends to the further discussion below. Note also that the cases in (5a) represent violations of Sievers' version of the Sonority Sequencing Principle as given in (2), but not of Jespersen's, which tolerates clusters of equal sonority within the syllable. Data sources for the less familiar languages are as follows: Mohawk (Michelson 1988), Cambodian (Huffman 1972), Marshallese (Bender 1976), Ewe (author's field notes, standard dictionaries), Pashto (Bell and Saka 1983), Klamath (Barker 1963), Ladakhi (Koshal 1979), Kota (Emenau 1944), Abaza (Allen 1956), Tocharian A (Coppieters 1975; J. Jasanoff, p.c), Yatee Zapotec (Jaeger and Van Valin 1982), Turkish (Clements and Keyser 1983), Berber (Dell and Elmedlaoui 1985), Luganda (Tucker 1962), Bella Coola (Nater 1984). A few representative references follow: Allen (1956) (Abaza), Dell and Elmedlaoui (1985) (Berber), Huffman (1972) (Cambodian), Nater (1984) (Bella Coola). See also Bell and Saka (1983) for a detailed examination of Pashto. (Notice that while Dell and Elmedlaoui argue that Berber largely conforms to sonority sequencing restrictions, they also recognize language-particular configurations in which these requirements are suspended.) Heffner (1950:74) states that "sonority may be equated more or less correctly with acoustic energy and its quantities determined accurately by electronic means," citing Fletcher (1929) in support. It is true that Fletcher's methods of measuring the "phonetic power" of segments give us a ranking grossly similar to familiar sonority scales, with vowels at one end and obstruents at the other. But Fletcher's results do not support the finer distinctions usually thought to be required for linguistic purposes. Thus by one of his measures (the "threshold" method), the nonanterior sibilants represented by orthographic ch, sh ranked higher in power (roughly equivalent to sonority) than nasals and all other obstruents, and the voiceless stop [k] ranked higher than fricatives or voiced stops. Moreover, Fletcher observed a high degree of interspeaker variation, suggesting that crucial details of such phonetic measures might vary substantially from speaker to speaker. This definition, which follows Catford (1977: 119-127), includes voiceless sonorants, 326
The role of the sonority cycle in core syllabification
13
14
15
16 17 18
which are normally produced with audible turbulence. Unlike Catford, however, I consider all vowels to be approximants. The sonority ranking of voiceless approximants is not well established, and requires further examination. The term " approximant" was first introduced in Ladefoged (1964), and replaces the older term " frictionless continuant." Bell notes: "among the languages with only syllabic nasals, very few are subject to vowel reduction; of those with syllabic liquids, all but a handful do have some form of vowel reduction. The formation of syllabic liquids may be strongly disfavored where nonreduced vowel syncope is the process of origin, but not disfavored under reducedvowel syncope" (171). The CSP differs from a similar algorithm given for English syllabification by Kahn (1980) in being universal rather than language-particular. It does not syllabify in terms of language-particular initial and final clusters, as does Kahn's rule, but in terms of the universal sonority scale. In this view, language-particular differences in core syllabification are attributed to further parameters of core syllabification, such as length constraints of the sort just mentioned, or to further rules of core syllabification that apply independently of sonority restrictions, as discussed below. Versions of the Maximal Onset Principle were known to the ancient Sanskrit and Greek grammarians (Varma 1929; Allen 1951). It is usually considered to have had exceptions in Indo-European, however; see Hermann (1923), Borgstrom (1937), Schwyzer (1939), and Lejeune (1972) for relevant discussion. This assumes either simultaneous or right-to-left application of the core syllabification rules. As we will see below, our final statement of the Core Syllabification Principle in (27) will be consistent with both of these modes of application. This skewing may explain the asymmetries between initial and final clusters noted by Reilly (1986). The term demisyliable as used here is inspired by Fujimura's account, but differs from it in significant respects. Fujimura has used it to designate a phonetic sequence used for the purposes of speech synthesis and automatic speech recognition (Fujimura et al. 1977), and has characterized it as follows: We have tentatively decided on an operational rule for " cutting " each syllable in two, producing initial and final demisyllables...The cutting rule may be stated: "Cut 60 msec after release, or if there is no release, 60 msec after the onset of the vocalic resonance." This is usually a point shortly after the beginning of the so-called steady state of the vowel, that is, after the consonant-vowel transition.
In this usage, the demisyllable is an acoustic unit. Fujimura also conceives of the demisyllable as a phonological unit, one of the two halves into which syllables "cores" are divided (Fujimura and Lovins 1977; Fujimura 1979, 1981). This unit has not previously been used in the statement of phonological rules and constraints to my knowledge, although it has been identified with the onset/rhyme distinction inside the syllable core (Fujimura 1981: 79). In my usage, for reasons to be made clear, demisyllables are not identified with onsets and rhymes; see especially section 17.8 for discussion. 19 I assume that in languages without diphthongs, long vowels are represented as VV, while in languages with (falling) diphthongs, long vowels (and falling diphthongs) are represented as VC. It follows from this and from the definition in (15) that in syllables containing long vowels VXV2, the first demisyllable ends in Vx and the second begins with V2, while in syllables containing VC diphthongs the first ends in V and the second 327
G. N. CLEMENTS
begins with the same V. In languages whose long syllable nuclei are characteristically nondiphthongal and therefore of the type VV, the distribution of long vowels tends to be equivalent to that of short vowels (see Vago 1985 for Hungarian). In contrast, in languages having diphthongs and long vowels of the type VC, such as German and English, the distribution of long vowels tends to be equivalent to that of short vowels followed by consonants (Moulton 1956; Selkirk 1982: 351). 20 Notice further that by the complexity metric (20a), OL (D = Oil) is ranked as more complex than OV (D = 006), ON (D = 025) as more complex than OL, and so forth. Thus, (20) predicts that syllable peaks increase in complexity as they decrease in sonority. As noted earlier, this is not quite correct, as syllables with syllabic nasals have been more frequently reported across languages than syllables with syllabic liquids. It remains to be seen whether this unexpected reversal reflects the relative complexity of syllabic nasals and liquids, or some other factor. 21 There are no exceptions to this statement in morpheme-final position. Morphemeinternally, the only common exceptions are napkin, pumpkin, breakfast, magpie, tadpole,
aardvark, Afghanistan, and frankfurter. Proper names show frequent violations but may usually be analyzed into a stem and name-forming suffix, as in Bradford/ Bedford, Cambridge/Sturbridge,
Lindberg/Sandberg,
Bradbury/ Woodbury, Tompkins/ Watkins,
Hatfield/Westfield. 22 A similar account of the exceptional status of coronals has been proposed by Devine and Stevens (1977) in the context of their discussion of Latin syllabification (I thank John McCarthy for calling this work to may attention). There are rarer cases of languages exhibiting a preference for wowcoronals in certain positions, for which an alternative explanation will be required. One such case involves the occurrence of clusters like kt, pt, mn in Attic Greek to the exclusion of clusters like tk, tp, nm; however, Steriade (1982: ch. 4) argues that the initial members of such clusters are extrasyllabic throughout the lexical phonology. 23 A few qualifications are in order. First, Greenberg's generalizations concerned initial and final position in the word, not the syllable, and therefore do not necessarily translate directly into syllable structure. We have already noted that initial and final clusters in the syllabification domain (typically, the word) often deviate somewhat from initial and final clusters in internal syllables, especially in permitting extrasyllabic sequences or "appendices." As such sequences often reflect the operation of syllabification rules that override the usual sonority constraints, we would expect Greenberg's data to be less supportive of the theory developed here than generalizations based exclusively on syllabification data. Second, Greenberg's survey was based on a study of the descriptive literature, and inherits the analytical weaknesses and inadequacies of its sources. As Greenberg notes, several arbitrary choices had to be made, particularly concerning the decision whether to regard stop-fricative sequences as clusters or affricates. Third, Greenberg's implicational universals are probably best regarded as statistical rather than categorical in nature. Several implications that were true of the sample have since proven to have exceptions in other languages: thus, Ladakhi has LOV syllables but not OLV syllables (Koshal 1979), and Yatee Zapotec has the rare GOV syllable type, as noted in section 3.1. The counterpart to this is that many statements that were not categorically true of Greenberg's sample may turn out to be significant when a wider sample of languages is considered. 24 These results do not depend on the identity of the sonority scale we choose; more complex scales recognizing a larger number of points will yield the same relationship between minimal distance and degree of complexity. For example, given the hypothetical seven-point sonority scale O < Z < N < L < R < G < V , the most
The role of the sonority cycle in core syllabification
25
26
27 28
equally distributed three-member demisyllable will be OLV. As we successively minimize the difference between the medial member of the demisyllable and either endpoint we increase the value for D and thus increase the complexity value C. For example, OLV has the value 025 for D, ONV has the value 034, and OZV has the value 107. There is a further difference between this account and accounts making use of the notion "minimal distance". Given the sonority scale O < N < L < G < V , our account predicts that we might find languages containing only one of two demisyllables of a given degree of complexity.This is because the Complexity/Length Hierarchy (21) only requires that given the presence of demisyllables of some degree of complexity «, demisyllables with lower degrees of complexity must also be present. For example, we should find languages with initial demisyllables of the form OGV but not ONV (or viceversa), both of which have a complexity rank of 2. A theory in which ONV is excluded by a minimal distance constraint would necessarily exclude OGV at the same time. Homorganic NC sequences are tautosyllabic in many languages, such as Bantu which allows NCV syllables both initially and word-internally. In these cases it is often plausible to analyze the NC sequence as a single prenasalized stop (Clements 1986), so that the demisyllable type is actually CV. See, however, Milliken (1988) for an account of Flap Formation in English (and similar rules in other languages) in terms of the spreading of subsets of major class features. In some languages, however, we find constraints holding across pairs of syllables such that the sonority rank of the onset of the second syllable must be equal to or greater than the sonority rank of the onset of the first. Williamson (1978), in her discussion of such a phenomenon in Proto-Ijo observes that it often arises historically through processes of consonant weakening in noninitial syllables, such as intervocalic voicing or spirantization. This phenomenon does not seem to reflect sonority considerations exclusively, since the dependencies in question often involve features such as voicing and continuance and may be equally well viewed as involving assimilation to the intervocalic context. Clearly, however, this is an important potential type of exception to our statement that deserves fuller and more systematic investigation. References
Allen, W. S. 1951. Phonetics in Ancient India. London: Oxford University Press. 1956. Structure and system in the Abaza verbal complex. Transactions of the Philological Society. 164-169. Aronoff, M. and R. T. Oehrle, eds. 1984. Language Sound Structure: Studies in Phonology Dedicated to Morris Halle by his Teacher and Students. Cambridge, MA: MIT Press. Barker, M. A. R. 1963. Klamath Dictionary. University of California Publications in Linguistics 31. Berkeley: University of California Press. Basboll, H. 1977. The structure of the syllable and proposed hierarchy of phonological features. In W. U. Dressier et al. (eds.) Phonologica 1976. Innsbruck, 143-148. Bell, A. 1977. The Distributional Syllable. In A. Juilland (ed.) Linguistic Studies Offered to Joseph Greenberg. Saratoga: Anma Libri, 249-262. 1978. Syllabic consonants. In J. H. Greenberg (ed.), 153-201. Bell, A. and M. M. Saka. 1983. Reversed sonority in Pashto initial clusters, Journal of Phonetics 11: 259-275. Bender, B. W. 1976. Marshallese Reference Grammar. Hawaii: University of Hawaii Press. 329
G. N. CLEMENTS
Bloomfield, L. 1914. The Study of Language. New York: Henry Holt; new edition, Amsterdam: John Benjamins, 1983. Borgstrom, C. Hj. 1937. The Dialect of Barra in the Outer Hebrides. Norsk Tidsskrift for Sprogvidenskap 8: 71-242. Borowsky, T. J. 1986. Topics in the lexical phonology of English. Ph.D. dissertation, University of Massachusetts at Amherst. Browman, C. P. and L. Goldstein. 1986. Towards an articulatory phonology. Phonology Yearbook 3: 219-252. Catford, J. C. 1977. Fundamental Problems in Phonetics. Bloomington: Edinburgh and Indiana University Press. Chomsky, N. and M. Halle. 1968. The Sound Pattern of English, New York: Harper and Row. Christdas, P. 1988. Tamil phonology and morphology. Ph.D. dissertation, Cornell University. Clements, G. N. 1985. The geometry of phonological features. Phonology Yearbook 2: 223-252. 1986. Compensatory lengthening and consonant gemination in Luganda. In L. Wetzels and E. Sezer (eds.) Studies in Compensatory Lengthening. Dordrecht: Foris. 1987a. Phonological feature representation and the description of intrusive stops. In A. Bosch et al. (eds.) Papers from the Parasession on Autosegmental and Metrical Phonology. Chicago: Chicago Linguistic Society. 1987b. Substantive issues in feature specification, Talk given at MIT, Cambridge, MA. Clements, G. N. and S. J. Keyser. 1981. A three-tiered theory of the syllable. Occasional Paper no. 19, Center for Cognitive Science, MIT. Clements, G. N. and S. J. Keyser. 1983. CV Phonology: a Generative Theory of the Syllable. Linguistic Inquiry Monograph 9. Cambridge, MA: MIT Press. Coppieters, R. 1975. The Fremdvokal in Tocharian A. MS. Department of Modern Languages and Literatures, Pomona College, Claremont, CA. Cutting, J. 1975. Predicting initial cluster frequencies by phoneme difference. Haskins Laboratory Status Report on Speech Research, SR-42/3: 233-239. Davidsen-Neilsen, N. 1974. Syllabification in English words with medial sp, st, sk. Journal of Phonetics 2: 15-45. Dell, F. 1980. Generative Phonology and French Phonology. Cambridge: Cambridge University Press. [Original French edition: Les regies et les sons. Paris: Hermann, 1973.] Dell, F. and M. Elmedlaoui. 1985. Syllabic consonants and syllabification in Imdlawn Tashlhiyt Berber, Journal of African Languages and Linguistics 7: 105-130. Devine, A. M. and L. D. Stephens. 1977. Two Studies in Latin Phonology. Saratoga, CA: Anma Libri. Dewey, G. 1923. Relative Frequency of English Speech Sounds. Cambridge, MA: Harvard University Press. Emenau, M. B. 1944. Kota Texts, Part 1. University of California Publications in Linguistics 2:1. Berkeley: University of California Press. Fletcher, H. 1929. Speech and Hearing. New York: Van Nostrand. Foley, J. 1970. Phonological distinctive features. Folia Linguistica 4: 87-92. 1972. Rule precursors and phonological change by meta-rule. In Stockwell and Macaulay (eds). 1977. Foundations of Theoretical Phonology. Cambridge: Cambridge University Press. Fujimura, O. 1979. An analysis of English syllables as cores and affixes. Zeitschnft fur Phonetik, Sprachmissenschaft und Kommunikationsforschung 32: 471-6. 330
The role of the sonority cycle in core syllabification
1981. Temporal organization of articulatory movements as a multidimensional phrasal structure. Phonetica 38: 66-83. Fujimura, O. and J. B. Lovins. 1977. Syllables as concatenative phonetic units. Distributed by the Indiana University Linguistics Club. Revised and abridged in A. Bell and J. B. Hooper (eds.) Syllables and Segments. Amsterdam: NorthHolland, 1978, 107-120. Fujimura, O., M. J. Macchi, and J. B. Lovins. 1977. Demisyllables and affixes for speech synthesis. 9th International Congress on Acoustics, Madrid; manuscript, A.T. & T. Bell Laboratories. Grammont, M. 1933. Traite de Phonetique. Paris: Librairie Delagrave. Greenberg, J. H. 1978. Some generalizations concerning initial and final consonant clusters. In J. H. Greenberg (ed.) 243-79; originally published in Russian in Voprosy jfazykoznanija 4: 41-65. 1964. Greenberg, J. H., ed. 1978. Universals of Human Languages, vol. 2: Phonology. Stanford, CA: Stanford University Press. Halle, M. and G. N. Clements. 1983. Problem Book in Phonology. Cambridge, MA: MIT Press/Bradford Books. Halle, M. and J.-R. Vergnaud. 1980. Three dimensional phonology. Journal of Linguistic Research 1: 83-105. Hankamer, J. and J. Aissen. 1974. The sonority hierarchy. In A. Bruck et al. (eds.) Papers from the Parasession on Natural Phonology. Chicago: Linguistic Society. Harris, J. 1983. Syllable Structure and Stress in Spanish: a Nonlinear Analysis. Linguistic Inquiry Monograph 8. Cambridge, MA: MIT Press. 1987. Santisima, la Nauyaca; Notes on Syllable Structure in Spanish. MS. Cambridge, MA: MIT. Hayes, B. 1985. A Metrical Theory of Stress Rules. New York: Garland Publishing. Heffner, R.-M. S. 1950. General Phonetics. Madison WI: University of Wisconsin Press. Hermann, E. 1923. Silbenbildung im Griechischen und in den anderen indogermanischen
Sprachen. Gottingen: Vandenhoek and Ruprecht. Hooper, J. B. 1972. The syllable in phonological theory. Language 48: 525-540. 1976. An Introduction to Natural Generative Phonology. New York: Academic Press. Huffman, F. E. 1972. The boundary between the monosyllable and the disyllable in Cambodian. Lingua 29: 54—66. Hulst, H. van der. 1984. Syllable Structure and Stress in Dutch. Dordrecht: Foris. Hulst, H. van der and N. Smith, eds. 1982. The Structure of Phonological Representations, Parts 1 and 2. Dordrecht: Foris. Hyman, L., A. Duranti, and M. Morolong. 1980. Towards a typology of the direct object in Bantu. In L. Bouquiaux (ed.) Uexpansion bantoue, vol. 2. Paris: Societe d'Etudes Linguistiques et Anthropologiques de France, 563-582. Jaeger, J. J. and Van Valin, Jr. 1982. Initial consonant clusters in Yatee Zapotec. IjfAL 48: 125-138. Jakobson, R., G. Fant, and M. Halle. 1952. Preliminaries to Speech Analysis: the Distinctive Features and their Correlates. Cambridge, MA: MIT Press. Janson, T. 1986. Cross-linguistic trends in the frequency of CV syllables. Phonology Yearbook 3: 179-195. Jespersen, O. 1904. Lehrbuch der Phonetik. Leipzig and Berlin. Kahn, D. 1980. Syllable-based Generalizations in English Phonology. New York: Garland Publications. Keating, P. 1983. Comments on the jaw and syllable structure. Journal of Phonetics 11: 401-406. 1985. The phonetics/phonology interface. UCLA Working Papers in Phonetics 62:1-13. 331
G. N. CLEMENTS
Kiparsky, P. 1979. Metrical structure assignment is cyclic in English. Linguistic Inquiry 3: 421—441. 1981. Remarks on the metrical structure of the syllable. In W. U. Dressier et al. (eds.) Phonologic a 1980 (Innsbrucker Beitrdge zur Sprachmissenschaft, vol. 36), Innsbruck. 1985. Some consequences of lexical phonology. Phonology Yearbook 2: 83-138. Koshal, S. 1979. Ladakhi Grammar. Delhi, Varanasi and Patna: Motilal Banarsidass. Ladefoged, P. 1964. A Phonetic Study of West African Languages. Cambridge: Cambridge University Press. 1982. A Course in Phonetics, 2nd edition. New York: Harcourt Brace Jovanovich. Laeufer, C. 1985. Some language-specific and universal aspects of syllable structure and syllabification: evidence from French and German. Ph.D. dissertation, Cornell University. Lekach, A. F. 1979. Phonological markedness and the sonority hierarchy. In K. Safir (ed.) Papers on Syllable Structure, Metrical Structure and Harmony Processes. MIT Working Papers in Linguistics, vol. 1. Lejeune, M. 1972. Phonetique historique du mycenien et du grec ancien. Paris: Klincksieck. Liljencrants, J. and B. Lindblom. 1972. Numerical simulation of vowel quality systems: the role of perceptual contrast. Language 48: 839-862. Lindblom, B. 1983. Economy of speech gestures. In P. F. MacNeilage (ed.) The Production of Speech. New York: Springer-Verlag, 217-245. Lowenstamm, Jean. 1981. On the maximal cluster approach to syllable structure. Linguistic Inquiry 12: 575-604. Michelson, K. 1988. A Comparative Study on Lake-Iroquoian Accent. New York: Kluwer Academic Publishers. Milliken, S. 1988. Protosyllables: syllable and morpheme structure in the lexical phonology of English. Ph.D. dissertation, Cornell University. Moulton, W. 1956. Syllable nuclei and final consonant clusters in German. In M. Halle et al. (eds.) For Roman jfakobson. The Hague: Mouton. Murray, R. W. and T. Vennemann. 1983. Sound change and syllable structure in Germanic phonology. Language 59: 514—528. Nater, H. F. 1984. The Bella Coola Language. Canadian Ethnology Service Mercury Series paper no. 92. National Museums of Canada, Ottawa. Noske, R. 1982. Syllabification and syllable changing rules in French. In van der Hulst and Smith (eds.), part 2, 257-310. Ohala, J. and H. Kawasaki. 1984. Prosodic phonology and phonetics. Phonology Yearbook 1: 87-92. Pike, K. 1943. Phonetics: A Critical Analysis of Phonetic Theory and a Technic for the Practical Description of Sounds. Ann Arbor: University of Michigan Press. Price, P. J. 1980. Sonority and syllabicity: acoustic correlates of perception. Phonetica 37: 327-343. Prince, A. 1984. Phonology with tiers. In Aronoff and Oehrle (eds.) 234-44. Pulgram, E. 1970. Syllable, Word, Nexus, Cursus. The Hague and Paris. Mouton. Reilly, W. T. 1986. Asymmetries in place and manner in the English syllable. Research in Phonetics and Computational Linguistics Report No. 5. Departments of Linguistics and Computer Science, Indiana University, Bloomington, 124—148. Rialland, A. 1986. Schwa et syllabes en Francais. In L. Wetzels and E. Sezer (eds.) Studies in Compensatory Lengthening. Dordrecht: Foris, 187-226. Roberts, A. H. 1965. A Statistical Linguistic Analysis of American English. The Hague: Mouton. 332
The role of the sonority cycle in core syllabification Saussure, F. de. 1916. Cours de linguistique generale. Lausanne and Paris: Payot. Schein, B. and D. Steriade. 1986. On geminates, Linguistic Inquiry 17: 691-744. Schwyzer, E. 1939. Griechische Grammatik I. Munich. Selkirk, E. 1982. The syllable. In van der Hulst and Smith (eds.), part 2, 337-383. (1984) "On the major class features and syllable theory," in M. Aronoffand R. T. Oehrle (eds.), 107-113. Sievers, E. 1881. Grundziige der Phonetik, Leipzig: Breitkopf and Hartel. Steriade, D. 1982. Greek prosodies and the nature of syllabification. Ph.D. dissertation, MIT, Cambridge, MA. Stevens, K. and S. J. Keyser. 1987. Primary features and their enhancement in consonants. MS. MIT, Cambridge, MA. Tucker, A. N. 1962. The syllable in Luganda: a prosodic approach. Journal of African Linguistics 1: 122-166. Vago, R. 1985. Degenerate CV tier units. MS. Queens College, CUNY. Van Coetsem, F. 1979. The features 'vocalic' and 'syllabic'. In I. Rauch and G. F. Carr (eds.), Linguistic Method: Essays in Honor of Herbert Penzl. The Hague: Mouton, 547-556. Varma, S. 1929. Critical Studies in the Phonetic Observations of Indian Grammarians. London: The Royal Asiatic Society. Whitney, W. D. 1865. The relation of vowel and consonant. Journal of the American Oriental Society, vol. 8. Reprinted in W. D. Whitney, Oriental and Linguistic Studies, Second Series, Charles Scribner's Sons, New York: 1874. Williamson, K. 1978. Consonant distribution in Ijo. In M. A. Jazayeri et al. (eds.) Archibald Hill Festschrift, vol. 3. The Hague: Mouton, 341-353. Zwicky, A. 1972. A note on a phonological hierarchy in English. In R. P. Stockwell and R. K. S. Macaulay (eds.) Linguistic Change and Generative Theory. Bloomington: Indiana University Press, 275-301.
333
18 Demisyllables as sets of features: comments on Clements's paper OSAMU FUJIMURA
Clements addresses himself to one of the oldest issues in linguistics - namely, how syllables are formed out of constituent phonemic segments - and in doing so, makes a noteworthy step forward toward a rigorously formulated theory of sound patterns. What I can contribute to this endeavor, I hope, is a somewhat different perspective on the issue. My suggestions may be too radical to be immediately useful, but if they manage to direct phonologists' attention to some emerging findings in experimental phonetics, particularly articulatory studies, that are relevant to the very basis of nonlinear phonological theories like Clements's, and if we can thereby acquire a better understanding of the basic assumptions underlying the current theoretical development both in abstract and concrete representations, it will justify my participation in this forum of discussion. It is crucial, from my point of view, to note that Clements's idea of a demisyllable's internal organization is different from the idea I have been developing over the years. For Clements, a well-formed string of (phonemic) segments constitutes a demisyllable; the demisyllable is the domain over which sonority constraints on segment organization work. My approach, in contrast, is to try to use demisyllables as minimal integral units in place of phonemes. From this point of view, I defined a demisyllable as a set of (unordered) feature specifications (Fujimura 1979: 474), even though originally my way of thinking was still more in terms of phonemic sequences. When I coined the term demisyllable (Fujimura 1976), my description was given in a meeting abstract as follows: A method is proposed to decompose English syllables into phonetically and phonotactically well-motivated units, so that the complete inventory for segmental concatenation will contain at most 1,000 entries and still reproduce natural allophonic variations. A syllable is decomposed into a syllable core and syllable affix(es). Syllable cores have a general form of (Ci)(Ci)V(Cf)(Cf), where V is a single vowel, and Ci and Cf are consonantal elements where Cf may be a glide (semivowel or vowel 334
Demisy liables as sets of features
elongation). The /sp/, /st/, and /sk/ are treated as single consonantal elements with a place specification, and the ordering of the elements within the core is strictly governed by the vowel affinity principle (Fujimura, IEEE Trans. ASSP-23, 82-87 [1975]. The syllable core is divided into initial and final halves - demisyllables each of which can contain only one specification of the place of articulation. The final consonantal elements (such as / s / in /taeks/) that follow a placespecified true consonant (/k/ in this case) are treated as syllable affixes. These affixes are all apical, and they observe voicing assimilation with respect to the true consonant in the core. (Incidentally, I used the term syllable affix anticipating prefixes in some languages, though in English affixes are all suffixes. Of course I described syllable affixes to be independent of morphological considerations, as exemplified by words such as lens and tax. Also, we mentioned that syllable affixes in English occurred only morpheme-finally. See Fujimura and Lovins 1978 [IULC version], footnote 41.) In my (1979) analysis the vowel affinity assigned inherently to individual demisyllabic features (or specific combinations thereof) makes the order specification of segments redundant, and phonetic implementation rules will automatically form the demisyllabic articulatory signals as a multidimensional set of time functions. More exactly, the temporal parameters specified in a language's phonetic implementation rule system (probably in its feature table) can be interpreted qualitatively as vowel affinity assignments. I used the term vowel affinity rather than sonority to indicate that this concept is not definable purely in terms of universal phonetics, and therefore is basically a phonological concept. Phonemic segments, to the extent that their effects are approximately contained within their time domains in the signals thus formed, are epiphenomena, however useful as a practical approximation they may be. Indeed, my point when I used demisyllables for speech synthesis was that for signal manipulation as opposed to, say, transcription of speech, the most useful concatenative (i.e. segmental) units are demisyllables rather than phonemes. The same point can be made for automatic speech recognition (Rosenberg et al. 1983). But my underlying interest was to optimize the abstract representation of speech, even at the cost of reformulating phonological theory. I did not discuss lexical representations explicitly, because I did not know how I could formulate the framework. I was also keenly aware that I had not examined languages other than English in appropriate depth. The crucial question there is whether my notion of total redundancy of phoneme sequence specification holds for the full variety of languages. In particular, are there cases where, within a demisyllable, a pair of feature specifications constitute a minimal contrast in terms of temporal sequence, so that their timing characteristics assigned in implementation rules must be controlled by lexical specifications ? For example, in the case of forms like Russian 'tkut' and ' kto' - Clements's 335
OSAMU FUJIMURA
example (5) - we could treat one of the initial stop sequences, say /kt/, as a sequence of syllable affix /k/ and the initial demisyllable {to},1 while accounting for /tk/ by implementation rules specifying apical events to take place more peripherally than other articulatory events. Or, if we cannot find any independent motivation for such a treatment, we may need to resort to some representational scheme that specifies the exceptional status of the element in the lexicon. In the case of English clusters, the abundance of examples such as task vs. tax but apt vs. *atp motivates treating /sp, st, sk/ as special single units,2 and phonetic characteristics of the plosives involved as well as other evidence (such as rhyming patterns in some poetic styles, Kiparsky, personal communication) support this special status of the clusters. (See Fujimura and Lovins 1978 for some discussion of the phonetics.) Based on this analysis, we claim that the place specification associated with a true consonant is given only once in English demisyllables, simplifying the phonetic implementation rule system greatly. While the framework for such a system apparently allows for language-particular options, the status of syllable affixes as more or less independent concatenative units - i.e., their capability of exchanging temporal positions, their stable phonetic patterns in different contexts, etc. - is expected to be universal. Also, it is not surprising to see a high statistical correlation between the particular forms of syllable affixes and occurrences of morphemic affixes in the given language. From this point of view, we need to ask about each apparent violation of the sonority cycle principle, not what alphabetical sequence appears in its phonetic transcription, but what its exact phonetic characteristics are. The goal is to understand how a syllable can really be defined, how it can be determined whether some apparent candidate for a local sonority peak should be identified as a separate syllable (or an affix). Particularly from an articulatory point of view, the issue is what the temporal specifications are for inherent segmental gestures (i.e. gestures inherent to demisyllabic features). We have observed in our x-ray microbeam data that the synchronization of events among different articulators is rather loose in general (see Fujimura 1981, 1986, 1987; Browman and Goldstein, this volume). In order to interpret such data, I assume a system of phonetic implementation rules that takes a phonological representation of a given linguistic form as its input, and computes the time course of articulation as its output. This computation takes the form of giving parametric values such as the timing of each critical movement (Fujimura 1986), or target values for constrictive positions of obstruents (by specifying the physiological control variable values for a pattern of contractions of a particular set of muscles, see Fujimura and Kakita 1979). Before we decide precisely what the temporal values involved in individual rules of phonetic implementation are, we must examine qualitative characteristics such as the ordering of events for different classes of features. The sonority cycle, from this perspective, would provide the basis for determining, first for each articulator, how crucial articulatory events 336
Demisyllables as sets of features
must be sequenced within a demisyllable, and then, among different articulators, how events will be linked to each other. The temporal organization of acoustic signal characteristics should emerge as the result of such computations. These acoustic, or perceptual, characteristics, expressed as a general principle like the sonority cycle, could well work as constraints for designing implementation rule systems. I would therefore be interested in finding out how exceptional with respect to each articulator "exceptional" cases are. There are three questions to be asked. First, what features using the same articulator can be specified together (as an unordered set) for the same demisyllable? Second, how are their corresponding gestures temporally arranged? Third, do sequence constraints thus discovered follow the same principle, in terms of some manner hierarchy, across different articulators and across different languages, as well as across different levels of phonological description ? As an answer to these questions, we might propose, for example, that the temporal sequencing of a set of articulatory events involving the tongue tip/blade, such as gliding, rhoticization, incomplete stop formation for a lateral, frication, and stop closure (including nasal stops), is fixed in this order from nucleus to margin. (We assume a mirror image between initial and final demisyllable, and the use of the same feature identities for both types of demisyllables reflects this assumption.) We might propose further that the same temporal sequencing principle will apply to other articulatory dimensions, and that it may be found to be generally applicable to different languages in the sense that phonetic implementation rules universally conform to reflect such a general constraint. In following this approach, we obviously need to use features for characterizing vowel gestures different from those for consonantal gestures, and we need to specify glides by separate features. Therefore, English /kwajt/ quite in its initial demisyllable has a (consonantal) place specification for dorsal (tongue body) and a (glide) feature specification of labio-velarization.3 Most cited exceptions do not demonstrate any violations of this principle within the same articulatory dimension. Apparent exceptions such as English /st/ and /ts/ in final clusters, can be accounted for by reconsidering the abstract identities of the elements in question, as we have seen above. But there are other more interesting counterexamples that pose a serious question as to the basic definition of the syllable. Before we address this question, however, let me point out one fairly general problem about alphabetic transcription, particularly transcriptions involving the peripheral positions of words. Suppose we have in some language a lexical contrast such as /kwa/ vs. /ka/. Given that a glide gesture tends to "spill over" the bounds of segmental integrity with respect to other articulatory (and voicing) features, particularly at the beginning and the end of an utterance in isolation (or at strong enough phrase boundaries), the /w/-gestures, labial or velar, may well be observed to occur quite noticeably before the implosion into the [k] in the case of /kwa/. (This is the case, 337
OSAMU FUJIMURA
for example, in English.) Assuming that this hypothetical language has no contrast between /kwa/ and /wka/ within the same syllable, it is not a critical question whether it is transcribed as /kwa/ or /wka/. In terms of the peak of the labial gesture in relation to the [k]-closure, it may well be the case that the latter is more accurate as a phonetic transcription of the isolated utterance, except that in most languages, extraneous and discontinuous voicing at the beginning of a syllable is impossible in an isolated utterance. On the other hand, if there is voicing for initial [w] before voiceless [k], resulting in a voicing contour —I 1— (where the minuses at the edges indicate voiceless pauses surrounding the word in isolation), then probably nobody would argue that this hypothetical [wka] is a monosyllabic form. What about the real cases cited as exceptional in Clements's list (5), particularly those in (5b)? The examples from Ewe, like /wlu/ are not as bad as the hypothetical [wka] above, in that the voicing contour remains normal. Assuming that there is no contrast with /lwu/, the phonological representation may well be an initial demisyllable specifying a labialization/velarization together with a lateral as its constituent features. If the word-initial semivowel [w] is actually articulated together with the lateral gesture but is widely spread in time (strong labialization may be a characteristic of this language, even without the context of the following rounded vowel), the [w] segment may well exist phonetically in an isolated utterance. Without knowing the language, I hesitate to speculate further to account for /yra/. There are several other languages in the list cited by Clements, which exhibit apparently similar situations. Russian syllables such as /mgla/, in contrast, would be very difficult to account for by the implementation rule scheme, since the demisyllabic feature specifications inherently do not contain order information. There is no violation of the sonority cycle principle for each articulator, in this case. But because the temporal location of nasality must be confined to the [m] segment, presumably, and the labial specification must be associated with this, such Russian syllables are a typical situation where a classical phonemic analysis would seem effective. In the core-affix approach, affixes behave as mini-syllables, so to speak, and from this point of view this device would be quite appropriate for handling such phonemic segments that are not included in a demisyllable, assuming that there is strong enough phonological reason for saying that the segment in question does not represent a separate syllable. Note that this example does not violate the voicing contour principle, which I take seriously as a necessary condition for a syllable (including syllable affixes). The most serious problem, from my point of view, is the Russian word /rta/ (and a few similar cases). Assuming that this is typically pronounced with a voiced apical trill followed by a voiceless [t], it is just like the hypothetical example of /wka/ discussed above. In fact, it is worse since both / r / and / t / use the tongue tip. Phonetically, I think, /rta/ must be disyllabic. If there is a strong reason to 338
Demisy liable s as sets offeatures say that at the lexical level, or more specifically in core phonology, /rta/ cannot be disyllabic, then I think we will have to have a scheme for the lexical representation that specifies the necessary (abstract) r-feature as an extrademisyllabic fragment. Syllabification would create phonetic syllables out of the abstract specifications of demisyllables and other fragments (which may or may not be syllable affixes). The sonority cycle is observed within demisyllables in the lexical representation, by definition, and it is expected to hold at all phonological and phonetic levels. The result of the phonetic implementation in our example here would be a disyllabic form (with a syllabic [r]). I think a syllable should be defined as a minimal unit that is utterable in isolation at the phonetic level, and any use of the term should be in some way consistently related to this phonetic notion.
Notes 1 I denote demisyllabic forms in curly braces. 2 Despite its phonetic identity with independent / s / in opposition to /f/, as in six vs. fix, I assumed that the / s / in spy was just like the / r / in pry, or the nasality in me vs. bee. One could say there is a manner paradigm /sp, p, b / in English, just like the paradigm /p, P (forced), b (nonapirated, lax)/ in Korean. That is, the s-feature is specified concomitant with but always implemented prior to the oral closure. This same s-feature, which I called "spirant" in Fujimura (1979), can be combined with a + nasal specification (and a place specification for the lip) as in smeII, as well as with the — nasal (and + stop) specification as in spell. It can also be combined with —stop (or + continuant) as in sphere (in a certain subclass of the English lexicon). When specified as present, the spirant feature suppresses the tense-lax distinction altogether (note that the voice onset time for spy is similar to phrase initial by rather than pie). In final demisyllables it does not combine with + nasal. 3 Incidentally, note that labialization for vowels and glides is phonetically rounding the protrusion of lips, whereas labial obstruent constrictions are commonly formed without such phonetic characteristics, even though there is often a mutual exclusion of the two types of gestures for the same articulator.
References Fujimura, O. 1975. The syllable as a unit of speech recognition. IEEE Transactions in Acoustics, Speech, and Signal Processing 23; 82-87. 1976. Syllables as concatenated demisyllables and affixes. Journal of the Acoustical Society of America 59: Suppl. 1, S55. 1979. An analysis of English syllables as cores and affixes. Zeitschrift fur Phonetik, Sprachwissenschaft und Kommunikationsforschung 32: 471-476. 1981. Temporal organization of articulatory movements as a multidimensional phrasal structure. Phonetica 38: 66-83. [Corrected version in A. S. House (ed.) Proceedings of the Symposium on Acoustic Phonetics and Speech Modeling, Part 2, Paper 5. Institute of Defense Analysis, Princeton, New Jersey.] 339
OSAMU FUJIMURA
1986. Relative invariance of articulatory movements: an iceberg model. In J. S. Perkell and D. H. Klatt (eds.) Invariance and Variability in Speech Processes. Hillsdale, NJ: Lawrence Erlbaum Associates, 226-234. 1987. Fundamentals and applications in speech production research. In Proceedings of the 11th International Congress of Phonetic Sciences, Tallinn, Vol. 6, 10-27. Fujimura, O., and Y. Kakita. 1979. Remarks on quantitative description of the lingual articulation. In S. Ohman and B. Lindblom (eds.) Frontiers of Speech Communication Research. London: Academic Press, 17-24. Fujimura, O., and J. B. Lovins. 1978. Syllables as concatenative phonetic units. In A. Bell and J. B. Hooper (eds.) Syllables and Segments. Amsterdam: NorthHolland, 107-120. [Full version distributed as a monograph by the Indiana University Linguistics Club, 1982.] Rosenberg, E. A., L. R. Rabiner, G. J. Wilpon, and D. Kahn. 1983. Demisyllable-based isolated word recognition system. IEEE Transactions in Acoustics, Speech, and Signal Processing 31: 713-726.
340
19 Tiers in articulatory phonology, with some implications for casual speech CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
19.1
Introduction
We have recently begun a research program with the goal of providing explicit, formal representations of articulatory organization appropriate for use as phonological representations (Browman and Goldstein 1986; Goldstein and Browman 1986). The basic assumption underlying this research program is that much phonological organization arises from constraints imposed by physical systems. This is of course a common assumption with respect to the elements features - used in phonological description; it is not such a common assumption, at least in recent years, with respect to the organization of phonological structures. In our view, phonological structure is an interaction of acoustic, articulatory, and other (e.g. psychological and/or purely linguistic) organizations. We are focusing on articulatory organization because we believe that the inherently multidimensional nature of articulation can explain a number of phonological phenomena, particularly those that involve overlapping articulatory gestures. Thus, we represent linguistic structures in terms of coordinated articulatory movements, called gestures, that are themselves organized into a gestural score that resembles an autosegmental representation. In order to provide an explicit and testable formulation of these structures, we are developing a computational model in conjunction with our colleagues Elliot Saltzman and Phil Rubin at Haskins Laboratories (Browman, Goldstein, Kelso, Rubin and Saltzman 1984; Browman, Goldstein, Saltzman, and Smith 1986). Figure 19.1 displays a schematic outline of this model, which generates speech from symbolic input. As can be seen from the number of submodels in the figure, gestures are relatively abstract. Even articulatory trajectories are one step more abstract than the output speech signal - they serve as input to the vocal tract model (Rubin, Baer, and Mermelstein 1981), which generates an acoustic signal. In addition, the actual articulatory trajectories associated with a gesture are generated from a dynamical description, which introduces another layer of abstraction. The
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN output speech
intended utterance
linguistic
task
vocal
gestural
dynamic
tract
model
model
model
Figure 19.1 Computation modeling of gestures using articulatory dynamics
particular dynamic model we are using, the task dynamics of Saltzman and Kelso (1987), requires that gestures be discrete, a further abstraction. That is, we assume that continuous movement trajectories can be analyzed into a set of discrete, concurrently active underlying gestures. And finally, the discrete, abstract, dynamically defined gestures are further organized into gestural scores in the linguistic gestural model. It is the qualities of discreteness and abstractness, when combined with the inherently spatiotemporal nature of the dynamically defined gestures, that give this system its power. As abstract, discrete, dynamic linguistic units, the gestures are invariant across different contents. Yet, because the gestures are also inherently spatio-temporal, it is possible for them to overlap in time. Such overlapping activation of several invariant gestures results in context-varying articulatory trajectories, when the gestures involve the same articulators, and iru varying acoustic effects even when different articulators are involved. That is, much coarticulation and allophonic variation occurs as an automatic consequence of overlapping invariant underlying gestures (see Fowler 1980; Liberman, Cooper, Shankweiler, and Studdert-Kennedy 1967). And these qualities of the system also provide a relatively straightforward and invariant description of certain casual speech phenomena, as we shall see in section 19.3. While the system is powerful, it is also highly constrained: there is a fair amount of structure inherent in the gestural framework. One important source of structure resides in the anatomy of the vocal tract, which provides a highly constrained, 342
Tiers in articulatory phonology
three-dimensional articulatory geometry; this structure will be outlined in section 1.1. A second important source of structure resides in the dynamical description, outlined in section 1.2. Both of these types of structure come together in the task dynamics model (Saltzman 1986; Saltzman and Kelso 1987), in which the basic assumptions are (1) that one primary task in speaking is to control the coordinated movement of sets of articulators (rather than the individual movements of individual articulators), and (2) that these coordinated movements can be characterized using dynamical equations.
19.1.1 Articulatory organization Vocal tract variables. In order for the task dynamics model to control the movement of a set of articulators, those articulators needed to accomplish the desired speech task or goal must first be specified. For example, a lip closure gesture involves the jaw, the lower lip, and the upper lip. These articulators are harnessed in a functionally specific manner to accomplish the labial closure task. It is the movement characteristics of the task variables (called vocal tract variables) that are controlled in task dynamics. Thus, the lip closing gesture refers to a single goal for the tract variable of lip aperture, rather than to a set of individual goals for the jaw, lower lip, and upper lip. The current set of tract variables and their associated articulators can be seen in figure 19.2. Gestures. In working with the tract variables, we group them into gestures. The oral tract variables are grouped in terms of horizontal-vertical pairs, where both members of a pair refer to the same set of articulators: LP-LA, TTCL-TTCD, TBCL-TBCD. ("Horizontal" and "vertical" refer to these dimensions in a straightened vocal tract, i.e. a tube model; thus, constriction degree is always considered to be orthogonal to, and hence "vertical" with respect to, the "horizontal" dimension of the stationary upper or back wall of the oral tract.) The oral gestures involving the lip, tongue tip, and tongue body thus consist of paired tract variables, where each tract variable associated with a gesture is modeled using a separate dynamical equation. That is, for oral gestures, two dynamical equations are used, one for constriction location and one for constriction degree. Since the glottal and velic aperture tract variables do not occur in pairs, they map directly onto glottal and velic gestures, respectively. We refer to the gestures symbolically, using the symbols displayed in table 19.1 for the gestures described in this paper. Gestural scores. In order to apply task dynamics to speech in a linguistically interesting way, we must be able to model the articulatory structure of an entire utterance in terms of a set of gestures. This larger structure we refer to as a gestural score. Figure 19.3a shows a symbolic representation of a hypothetical gestural score (for the word "palm," pronounced [pham]). The articulatory trajectories associated with the gestures can be visualized with the aid of figure 19.3b, which shows the trajectories of four tract variables: velic aperture, tongue 343
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN articulators involved
tract variable LP lip protrusion LA lip aperture TTCL tongue tip constrict location TTCD tongue tip constrict degree TBCL tongue body constrict location TBCD tongue body constrict degree VEL velic aperture GLO glottal aperture
upper & lower lips, jaw upper & lower lips, jaw tongue tip, body, jaw tongue tip, body, jaw tongue body, jaw tongue body, jaw velum glottis
VEL t LA
« - LP - >
Figure 19.2 Tract variables Table 19.1 Gestural symbols symbol a
P
T
a X K
referent
tract variable
palatal gesture (narrow) pharyngeal gesture (narrow) bilabial closing gesture alveolar closing gesture alveolar near-closing gesture (permits frication) alveolar lateral closing gesture velar closing gesture
TBCD, TBCL TBCD, TBCL LA, LP TTCD, TTCL TTCD, TTCL TTCD, TTCL TBCD, TBCL
body constriction degree, lip aperture, and glottal aperture. Each curve shows the changing size of a constriction over time, with a larger opening being represented by a higher value, and a smaller opening (or zero opening, such as for closure) being represented by lower values. Note that in some cases (e.g. velic aperture, lip aperture) this orientation produces a picture that is inverted with respect to the 344
Tiers in articulatory phonology Tier
Gestures
Velic
.-/*.
.
.+/*.
.
a
.
Oral: Tongue Body
.
.
.
.
.
Tongue Tip Lips
.
. ft. .
Glottal:
. p .
• r
(a)
Velic aperture
open
Tongue Body constriction degree
closed
open
Lip aperture
: closed
open
Glottal aperture
: closed Time Gestures: velic closure | I \
velic opening pharyngeal constriction bilabial closure glottal opening and closing (b)
Figure 19.3 Hypothetical gestural representation for "palm." (a) Symbolic gestural score, (b) hypothetical trajectories (closure is indicated by lowering). 345
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
vertical movement of the major articulator (velum, lower lip) involved in changing the constriction size. For tongue body constriction degree, the constriction is in the pharynx, so that the relevant articulatory movement (of the rear and root of the tongue) is primarily horizontal. The trajectories of the tongue and lip tract variables are approximated from measured articulatory data (for the utterance "pop," rather than "palm"), while the trajectories for the velic and glottal variables are hand-drawn estimates, included for illustrative purposes. The shaded areas show the portions of these trajectories that would be generated by the dynamical systems representing the constriction degree tract variables for each of the six labeled gestures. Articulatory tiers. The gestures in the gestural score are organized into articulatory tiers, where the tiers are defined using the notion of articulatory independence. Velic gestures are obviously the most independent, since they share no articulators with other gestures. In this limiting case, velic gestures constitute a wholly separate nasal (velic) subsystem and hence are represented on a separate velic tier. Glottal gestures also participate in an independent subsystem (although other laryngeal gestures, for example for larynx height, would also participate in this subsystem), and hence are represented on a separate glottal tier. The oral gestures form a third subsystem, with the jaw as a common articulator. Since the oral gestures are distinguished by different combinations of articulators, oral gestures are represented on three distinct oral tiers, one for the lips, one for the tongue body, and one for the tongue tip. Each of these is associated with a distinct pair of tract variables (see above). Note that within the oral subsystem, the tongue body and tongue tip tiers effectively form a further subclass, since they share two articulators, the jaw and the tongue body proper. Note that these articulatory tiers correspond closely to organizations posited by both phoneticians and autosegmental phonologists. The three oral tiers of lips, tongue tip, and tongue body correspond to the traditional groupings of places of articulation into three major sets: labial, lingual, and dorsal (Vennemann and Ladefoged 1973; Halle 1982; Ladefoged and Maddieson 1986). And autosegmental phonologists have often proposed tiers that correspond to independent articulatory systems, e.g. the larynx (for tone and voicing), the velum (for nasality), and the oral articulators (Clements 1980, 1985; Goldsmith 1976; Thrainsson 1978). 19.1.2
Dynamical description
In addition to specifying the articulatory structure of an utterance by selecting the appropriate gestures, the gestural score also specifies the values of the dynamic parameters for use in the task-dynamic model. The task dynamic model uses these values as the coefficients of damped mass-spring equations (see appendix A), thereby generating characteristic movement patterns for the tract variables as well as coordinating the component articulators to achieve these movement patterns. 346
Tiers in articulatory phonology
equilibrium position
90
180
270
360
90
180 (c)
270
360
0
90
90
0
180 (b)
180
270
270
360
360
90 180 270 360 (d)
Figure 19.4 Abstract underlying gesture, (a) One cycle, (b) equilibrium position, (c) critical damping, (d) phasing between two gestures.
Since the current version of the model assumes unit mass and critical damping, the stiffness k and the equilibrium position x0 are the parameters in the equation that can vary in order to convey linguistic information such as phonetic identity or stress (see appendix B). Figure 19.4 shows how these parameters are used to characterize an abstract underlying gesture. We begin by assuming that a gesture consists of an abstract underlying 360 degree cycle, represented in figure 19.4a by the hump (in impressionistic terms), which is also a single cycle of an undamped cosine (in more quantitative terms). Figure 19.4b shows the equilibrium position for an arbitrary tract variable associated with this abstract gesture. For the undamped cosine in figure 19.4b, the trajectory generated by the abstract gesture oscillates around the equilibrium position, which 347
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
is midway between the peak and valleys. The amount of time it takes for the gesture to complete this cycle is a reflection of its stiffness (given that we assume unit mass). The stiffer the gesture, the higher its frequency of oscillation and therefore the less time it takes for one cycle. Note this also means that, for a given equilibrium position, the stiffer the gesture, the faster the movement of the associated articulators will be. However, the trajectory actually generated by our system is qualitatively different from the undamped "hump" seen in figure 19.4b since we assume critical damping rather than zero damping. As can be seen in figure 19.4c, the trajectory generated by a critically damped gesture approaches the equilibrium position increasingly slowly, rather than oscillating around it. In fact, the equilibrium position in a critically damped system approaches the peak displacement, or "target" (in effect, the target is the asymptote). Because it takes an infinite amount of time to actually reach the target in a critically damped system, we have specified that the effective achievement of the target is at 240 degrees with respect to the abstract underlying 360 degree cycle. This means that effectively only half the underlying abstract cycle is generated by a single gesture: the " t o " portion of the underlying cycle. (This partial generation is exemplified in figure 19.3b; we are currently experimenting with generating the "fro" portion as well.) Task dynamics serves to coordinate the articulators within a particular gesture; it also coordinates the effects, on a single articulator, of several different concurrent gestures. It does not, however, yet provide a mechanism for coordinating the gestures themselves. That must be explicitly specified in the gestural score. The abstract specification we adopt involves a relative phase description (Kelso and Tuller 1985). In such a description, gestures are synchronized with respect to one another's dynamic states, rather than timed by an external clock. In the current version of our system, gestures are phased with respect to each other's abstract underlying 360 degree cycles, as can be seen in figure 19.4d. In the figure, the two gestures are phased such that the point corresponding to 240 degrees for the top gesture is synchronized with the point corresponding to 180 degrees of the bottom gesture. Given this gesture-to-gesture approach to phasing, a complete characterization of the gestural score for a particular utterance must specify which gestures are phased with respect to each other. In the next section, we explore this question further.
19.2
Gestural scores and tier redisplays
In the sections that follow, we will exemplify our explorations of the organization of gestures into gestural scores using X-ray data. These data come from the AT&T Bell Laboratories archive of X-ray microbeam data (Fujimura, Kiritani, and 348
Tiers in articulatoryphonology Gestures
Tier
a
i
Tongue Body Tongue Tip
. a .
Lips
.
. p .
X .
. x . a
•
(a)
piece plots Audio waveform Tongue rear (horizontal) Tongue blade (vertical) Lower Lip (vertical) 20
40
60
80
100
120
Time (frames) (b) Figure 19.5 "piece plots" ([pisfplats]). (a) Symbolic gestural score (oral tiers only), (b) X-ray pellet trajectories (closure is indicated by raising).
Ishida 1973; Miller and Fujimura 1982), partially collected by researchers at Haskins Laboratories. In the data we examined, the X-ray microbeam system tracked the positions of (up to seven) small lead pellets placed on the lower lip, the jaw, the tongue blade, the tongue dorsum mid, the tongue dorsum rear, and/or the soft palate, in addition to two reference locations. We have examined a sample of utterances from three speakers of Standard American English, using the horizontal and vertical displacements of the pellets over time as the source data for deriving the dynamic parameter values and phasing for our gestural scores. We begin, in this section, by exploring gestural scores for canonical forms; in 349
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
section 19.3, we will look at how these canonical forms are modified in casual speech. In this section, then, we focus on data in which the syllable structure differs among the utterances. So far, we have examined paired utterances of the form [.. .iCa...], [.. .iCCa...], and [.. .iCCCa...], where the second syllable is always stressed, and the pairs are distinguished by a word boundary occurring before the first consonant in one member of the pair, and after the first consonant in the other member of the pair. The single consonants were [s], [p], and [1]; the triplet was [spl]; the doublets were [sp], [si], and [pi]. Thus, the paired utterances differ in terms of how the initial consonant is syllabified, e.g. [...i# / spa...] vs. [...is#'pa...]. Figure 19.5a displays a symbolic gestural score (oral tiers only) for one of these utterances, [pis # plats]; we will be working with this score and variants thereof throughout this section. Figure 19.5b shows the articulatory trajectories from the X-ray data that this score is intended to represent, with the gestural symbols added at approximately the target. For our preliminary investigations, we have assumed that the horizontal displacement of the rear of the tongue corresponds most closely to the trajectory generated by the vocalic gestures we are using ({i} and {a}), while the lip and blade vertical displacements correspond to the trajectories generated by the consonantal gestures we are considering ({£}, {a}, and {X}). (We are using the curly braces { and } to denote gestures.) The measured pellet trajectories can only be an approximation to the trajectories that would be generated by individual gestures, however, partly because of uncertainty as to pellet placement (especially on the tongue), but mostly because of the overlap of gestures, particularly when they are on the same articulatory tier. In the case of gestural overlap, several gestures will contribute to the observed articulatory trajectories. That is, gestures are discrete, abstract entities that combine to describe/generate the observed continuous articulatory trajectories.
19.2.1
Rhythmic and functional tiers
Gestural scores are subject to the same display problem as any multitiered approach, namely, associations among tier nodes can only be easily specified for immediately contiguous tiers in a display. Moreover, the gestural score as described so far only contains articulatory information. Stress information, for example, is not included. Therefore, in this subsection we introduce two additional types of display, one that incorporates stress information, and one that facilitates the display of associations and phasing among the gestures, (Our use of "display" is based on that of Clements and Keyser 1983.) Before moving further into discussions of formalism, however, a word of caution is in order. We do not intend our use of formalism and symbols to be introducing new or "deeper" realities into a gestural description. That is, for us symbols do not generate the gestures, rather, symbols are pointers to gestures, for 350
Tiers in articulatory phonology
descriptive convenience. Similarly, the various displays serve merely to emphasize one or another aspect of the gestural organization; displays are always projections that serve to decrease the dimensionality of the underlying reality, again for descriptive convenience. Rhythmic tier. To incorporate stress information, we use an additional rhythmic tier. (We assume pitch accent is a separate but related phenomenon, following Selkirk [1984]; we do not attempt to account for pitch accent at present.) Nodes on the rhythmic tier consist of stress levels; each syllable-sized constellation of gestures will be associated with a stress node. Each stress node affects the stiffness and constriction degree of the gestures associated with it (see appendix B). Note that we do not call this a syllable tier. Our current hypothesis is that the rhythmic component is a separate and independent component, whereas syllabicity is a complex function of gestural organization. Since we are only barely beginning to get a handle on gestural organization, we prefer to be conservative in our postulated structures. That is, we continue to err on the side of under-structured representations. (In addition, it is possible that the rhythmic component may be associated with its own set of articulators. We are intrigued, for example, by the notion that the jaw may be heavily implicated in the rhythmic component, as suggested by Macchi [1985] among others.) The first redisplay, then, seen in figure 19.6a, simply adds the rhythmic tier to the gestural score. All the gestures occurring under the curly bracket are assumed to be affected by the stress value of the relevant node on the rhythmic tier. That is, the curly bracket is a shorthand notation indicating that the value on the rhythmic tier is associated with every gesture on the articulatory tiers displayed beneath it. Oral projection tier. An alternative display of the associations between the rhythmic tier and oral tiers is seen in figure 19.6b, where the dimensionality is reduced by projecting the gestures from the lip, tongue body, and tongue tip onto a single oral tier. The sequence on the oral tier indicates the sequence of achievement of the constriction degree targets for the gestures. That is, since the gestures are inherently overlapping, a single point must be chosen to represent them in their projection onto a single tier. The achievement of their target values represents the sequencing that occurs in canonical form; it also captures the sequencing information that most directly maps onto acoustically defined phonetic segments. Functional tiers. Another type of display serves to reorganize the gestures, using functional tiers. At the current stage of our explorations, we are positing two functional tiers, a vocalic one and a consonantal one. This distinction is functionally similar to that made in CV phonology (Clements and Keyser 1983; McCarthy 1981), especially in its formulation by Keating (1985). Like Keating, we are struck by the usefulness of separate C and V tiers for describing certain aspects of articulation, although we differ in our definition of the nodes on the tiers, which for us are dynamically-defined articulatory gestures. What C and V tiers can
CATHERINE P. BROWMAN A N D LOUIS GOLDSTEIN
Gestures
Tier
Rhythmic
Xsecondary
Tongue Body
i
a
(a)
a ||
Tongue Tip Lips
1
X
x
a
x2 /
Oral
\
MP
P
Rhythmic
(b)
V ^primary
P
1
i
\
. / '
x
a
p
X
\ a
\
\ x
a
Figure 19.6 Tier displays for "piece plots" ([pis#' plats]), (a) Symbolic gestural score (oral tiers only) with rhythmic tier added, (b) Oral projection with rhythmic tier.
crucially capture is the fact of articulatory overlap between the vowels and consonants. The X-ray data we have analyzed thus far (see, for example, figure 19.5b) have consistently supported the contention (Ohman 1966; Fowler 1983) that consonant articulations are superimposed on continuous vowel articulations, which themselves minimally overlap. As can be seen in figure 19.7a, this description is directly captured using functional C and V tiers. The association lines between the two tiers indicate that the associated consonantal gestures all co-occur with the vowel gesture. The adjacency of the gestures on the V tier indicates that the vowel articulations are effectively continuous, with the possibility as well of minimal overlap. We will discuss the C tier below. Here we simply note that the last consonantal gesture is not associated with a vocalic gesture. While we have not yet investigated syllable-final clusters in any detail, a cursory inspection suggests that the vocalic gesture in fact does not co-occur with this final consonantal gesture. It is intriguing to speculate how this might relate to extra-metrical consonants (cf. Hayes 1981) and/or phonetic affixes (Fujimura and Lovins 1978). C and V tiers have several advantages, beyond the clear advantage of direct representation of articulatory overlap. As has been noted elsewhere (e.g. Fowler 1981; Lindblom 1963; Ohman 1967), the ability to deal with overlapping articulatory specifications makes it possible to unify the account of temporal and segmental variability. For example, increasing the amount of overlap between the 352
Tiers in articulatory phonology V:
C:
i
(
a
a
(a) V:
C:
i
p
a
(3— A,
T —a
(b) Figure 19.7 Consonant and vowel tier displays for "piece plots" ([pis#'plats]). (a) Associations (overlap), (b) phasing.
articulatory movements associated with a vowel and a following consonant will simultaneously shorten the acoustic signal associated with the vowel and cause increasing amounts of coarticulation to be observed in the acoustic spectrogram and in the movements of individual articulators. In addition, positing separate C and V tiers as universal articulatory organizations leads to a new perspective on the role of both nonconcatenative (CV-based) morphology and vowel harmony. In the former case, McCarthy's (1981) analysis of Semitic morphology using C and V tiers can be seen simply as a morphologization of an already existing, universal articulatory organization. (A similar point was made by Fowler 1983.) In the latter case, vowel harmony simply becomes a natural extension of the already existing V tier. Our C and V tier display differs from related displays in other phonologies in the interpretation of sequencing, which acts like a combination of tier and (linear feature) matrix displays. That is, like autosegmental tier displays, associations among nodes on different tiers are indicated by association lines rather than by sequencing, thereby permitting many-to-one associations. Like both autosegmental and matrix displays, the sequence on each tier is meaningful. Unlike autosegmental displays, however, and like matrix displays, sequencing between tiers is also meaningful. In this sense, the C and V tier display is a two-dimensional plane, with sequencing proceeding along the horizontal dimension, and function type along the vertical dimension. The horizontal sequencing must capture exactly the same sequence of gestures as that displayed by the oral tier discussed above; that is, a constraint on all displays is that they must portray the canonical sequencing relations when projected onto a single tier. The sequencing between gestures on the V tier and the C tier is visually conveyed in figure 19.7 by the angle of the lines: a line slanting right (i.e. with its top to the right of its bottom) 353
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
indicates that the consonant(s) precede the associated vowel, while a line slanting left indicates that the consonant(s) follow the associated vowel. Contiguity operates differently on the C and V tiers. This captures a functional difference between vowels and consonants, where vowels act as a kind of background to the "figure" of the consonants. On the one hand, gestural contiguity on the V tier is completely independent of the C tier. This reflects the fact, to be discussed in detail in the next section, that vowel articulations are contiguous (or partially overlapping), regardless of the number of consonants intervening. On the other hand, contiguity on the C tier is sensitive to intervening vocalic gestures, in the sense that consonantal gestures overlap considerably less, if at all, when a vocalic gesture intervenes. A related effect of this functional difference has to do with the change in stiffness as more consonants are inserted between vowels: the vowel gestures decrease in stiffness while the consonantal gestures increase their stiffness. 19.2.2
Using functional tiers with phasing
In this subsection, we explore some details concerning the specification of phase relations among gestures on the C and V tiers. That is, given that gestures are spatio-temporal in nature, we need to be able to specify how they are coordinated - we cannot simply assume they are coordinated onset-to-onset, or onset-totarget. Figure 19.7b shows a schematic representation of the phase associations for our sample symbolic gestural score. Here, the only association lines from figure 19.7a that remain are those that phase gestures relative to each other. Statement (1) summarizes the phasing associations that hold between the V and C tiers: (1)
A vocalic gesture and the leftmost consonantal gesture of an associated consonant sequence are phased with respect to each other. An associated consonant sequence is defined as a sequence of gestures on the C tier, all of which are associated with the same vocalic gesture, and all of which are contiguous when projected onto the one-dimensional oral tier.
Notice that this phasing association statement follows the unmarked pattern of associations for tiers in autosegmental phonologies (left precedence). Statement (2a) specifies the numerical values for phase relations between a vowel and the following leftmost consonant. An example of this phasing can be seen in the X-ray pellet trajectory data depicted in figure 19.8 for [pip#'ap], for the vocalic gesture {i} (tongue rear) and the following consonantal gesture {/?} (lower lip). (2a)
A vocalic gesture and the leftmost consonantal gesture of a preceding associated sequence are phased so that the target of the consonantal gesture (240 degrees) coincides with a point after the target of the vowel (about 330 degrees). This is abbreviated as follows: C(240) = = V(330) 354
Tiers in articulatory phonology
peep op
Audio waveform Tongue rear (horizontal) Tongue blade (vertical) Lower lip (vertical) 20
40
60
80
100
120
Time (frames) Figure 19.8 X-ray pellet trajectories for "peep o p " ([pip#ap]), showing phasing for vowel and leftmost following consonant ([ip]). The arrows indicate the points being phased.
Statement (2b) specifies the numerical values for phase relations between a vowel and the preceding leftmost consonant. Figure 19.9a exemplifies this statement in the utterance [pi # 'plats] for the vocalic gesture {a} (tongue rear) and the preceding consonantal gesture {(}} (lower lip). (2b)
A vocalic gesture and the leftmost consonantal gesture of a preceding associated consonant sequence are phased so that the target of the consonantal gesture (240 degrees) coincides with the onset of the vocalic gesture (0 degrees). This is abbreviated as follows: C(240) = - V(0)
To complete our statements about phase relations for a single syllable-sized constellation, we need only specify how the remaining consonantal gestures are phased. Figure 19.9b exemplifies this statement, again using the utterance [pi #'plats], but this time for two consonantal gestures, the {X} gesture (tongue blade) and the immediately preceding {(3} gesture (lower lip). (3)
Each consonantal gesture in a consonant cluster is phased so that its onset (0 degrees) coincides with the offset of the immediately preceding consonant (about 290 deg.): Cn(0) = = 0 ^ ( 2 9 0 ) A consonant cluster is defined as a well-formed associated consonant sequence. A sequence is well-formed iff it conforms to the syllable structure constraints of the language. 355
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
pea piots
Audio waveform Tongue rear (horizontal) Tongue blade (vertical)
——r- \—^~— ^ —
Lower lip (vertical) 20
60
40
80
100
120
Time (frames) (a)
Audio waveform
..„••-.• juiiiiiiuMlluyiiiiiui.
^JllUlUiiiyiiiyiiij^
Tongue rear (horizontal) Tongue blade (vertical)
-
^
-
^
—
^
Lower lip (vertical) 20
60
40
80
100
120
Time (frames) (b) Figure 19.9 X-ray pellet trajectories for "Pea plots" [pi#'plats]. (a) Phasing for vowel and leftmost preceding consonant ([#'p...a]), (b) phasing for consonants ([pi]).
The above set of statements is sufficient to phase the vocalic and consonantal gestures in a single syllable-sized constellation of gestures; it does not, however, completely account for the phase relations required for the entire utterance [pis #'plats]. (Nor, of course, does it account for gestures on the other tiers, the glottal and velic. We will have nothing to say about the phasing of gestures on these tiers in this paper.) One more statement is needed, one that somehow 356
Tiers in articulatory phonology
coordinates the two constellations we are dealing with. For the X-ray data we are exploring, this additional statement appears to associate the leftmost of a sequence of intervocalic consonants with both vowels (an analog of ambisyllabicity): (4)
The leftmost consonantal gesture of a consonant sequence intervening between two vocalic gestures is associated with both vocalic gestures. A consonant sequence is defined as intervening iff the entire sequence lies between the two vocalic gestures when projected onto the one-dimensional oral tier.
Once statement (4) has applied to associate a consonantal gesture with the vocalic gesture in the neighboring constellation, the structural descriptions of statements (2) and (3) are satisfied for this new C-V association so that they automatically apply. A symbolic display of this process is seen in figures 19.10a and 19.10b for the utterance [pis #'plats], showing how the two constellations in figure 19.7 are associated and phased with respect to each other. Figure 19.11 exemplifies the process using X-ray trajectories for the utterance [pis #'plats]. This reapplication of statement (2) thus phases the onset of a vocalic gesture with respect to the target pf the leftmost consonant of a preceding associated consonant sequence, regardless of the canonical syllable affiliation of that consonant. In the reapplication, the potential transitivity of statements (2a) and (2b) is activated, so that the two vocalic gestures are effectively phased to each other. That is, the onset of the second vocalic gesture in figures 19.7b and 19.10b will coincide with 330 degrees of the first vocalic gesture. (This will also be the default in the case of two vocalic gestures with no intervening consonantal gestures.) The phasing statements (2)-(4) are a very sketchy beginning to the task of specifying the relations among gestures. How likely is it that they are a dependable beginning? We expect all of them, with the possible exception of (2a), to be confirmed, at least in essence, in future investigations. We are fairly confident about statements (2b) and (4) (although the precise numbers for the phases may need refining), since similar observations have been made by others. For example, Gay (1977, 1978) has shown that tongue movement toward the second vowel in a VCV sequence begins shortly after the onset of acoustic closure for the consonant. Similarly, Borden and Gay (1979) have shown such vowel movements beginning during the / s / of /sC/ clusters. We also are struck by the convergence of these articulatory phasing statements with principles of syllable affiliation proposed by phonologists. For example, statement (2b) is the articulatory analog of the Principle of Maximal Syllable Onset (cf. Clements and Keyser 1983), while statement (4) is the articulatory analog (for oral gestures) of Resyllabification (Clements and Keyser 1983). We are not particularly confident about the related statement (2a), however, because there is a complex interaction between phasing and stiffness, at least for vowels, about which we still understand very little. This is true in particular when 357
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
i
a
(a)
C: (b) Figure 19.10 Resyllabified consonant and vowel tier displays for "piece plots' ([pis#'plats]). (a) Associations (overlap), (b) phasing.
piece plots Audio waveform
.
Tongue rear (horizontal) Tongue blade (vertical)
/
^-T"
^iillllMMlJIIHnn
"
Lower lip (vertical) 20
40
60
80
100
120
Time (frames) Figure 19.11 X-ray pellet trajectories for "piece plots" ([pis#'plats]), showing resyllabified phasing for vowel and leftmost preceding consonant ( [ s # ' . . . a ] ) .
two constellations are concatenated, as in our example. Here, rephasing the second vowel to an earlier preceding consonant (as in the rephasing betweenfigures19.7b and 19.10b) will either cause the target of that vowel to be reached earlier, or necessitate a modification of the stiffness of the vocalic gesture so that its target is reached at about the same time relative to the immediately preceding consonant. If the stiffness of the vowel is changed, however, it will change the temporal relations between the vowel and the following consonantal gestures, assuming their phasing remains the same. While the current version of our phasing rules in the computational model uses the second approach, that of modifying the stiffness of the vocalic gesture, much more work is needed in this area. 358
Tiers in articulatory phonology
Statement (3), on the timing of onsets to preceding releases for consonants, has been reported elsewhere (e.g. Kent and Moll 1975), and so we are fairly confident about this statement as well. Kent and Moll's data also show that the position of syllable boundaries is irrelevant, at least for sequences that are possible wellformed syllable onsets. The exact formulation of the statement, however, is a matter of considerable interest to us. In particular, to what extent is the timing of a consonant dependent on the well-formedness of the consonant sequence (in terms of not violating syllable structure constraints)? In the next section, we explore some implications of this and other aspects of our proposed gestural structures for describing casual speech.
19.3
Generalizations about casual speech
The gestural representations that we have outlined above form the basis of a simplified account of the phonological/phonetic alternations that occur in continuous, fluent speech. In particular, a number of superficially unrelated alternations (unrelated in that their description requires separate phonological rules of the conventional sort) can be shown to follow from a generalization about gestural organizations and how they may be modified in the act of talking. The power of such generalizations follows from the incorporation of the spatiotemporal nature of speech in the representation, both in terms of the definition of individual gestures as events occurring in space and time, and in the explicit specification of the spatio-temporal (phase) relations among gestures. There have been a number of attempts by linguists to characterize the differences between the pronunciation of words in isolation and their realization in "casual" connected speech (e.g., Zwicky 1972; Shockey 1973; Oshika, Zue, Weeks, Neu, and Auerbach 1975; Kahn 1976; Brown 1977; Dalby 1984; Barry 1984). In this paper, we define "casual" speech as that subset of fast speech in which reductions typically occur. In casual speech, then, there are typically gross restructurings between the "ideal" phonetic representation of a word — its canonical form - and a narrow phonetic transcription of its form in context. Segments are routinely elided, inserted, and substituted for one another. The examples in (5) (taken from Brown 1977) show (a) consonant deletion, (b) consonant assimilation, and (c) simultaneous deletion and assimilation. (5)
(a) /'mAst bi/ (b) /hAndrad 'pallndz/ (c) /'grallnd 'preja*/
^ ['mAsbi] > [hAndrab'pallndz] > ['graUm'preJ^]
("must be") ("hundred pounds") ("ground pressure")
Thus, the narrow phonetic transcription of a word in context can be radically different from its systematic phonetic representation. While a number of the above authors have attempted to describe such changes with lists of phonological rules 359
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
that apply in casual, or fluent speech, these lists fail to uncover generalizations about casual speech that underlie these particular changes. Such generalizations do emerge, however, from a description of these changes in terms of the variation in their gestural scores. From the gestural point of view, the relationship between the lexical characterization of a word and its characterization in connected speech is much simpler and more highly constrained. We propose that most of the phonetic units (gestures) that characterize a word in careful pronunciation will turn out to be observable in connected speech, although they may be altered in magnitude and in their temporal relation to other gestures. In faster, casual speech, we expect gestures to show decreased magnitudes (in both space and time) and to show increasing temporal overlap. We hypothesize that the types of casual speech alternations observed (segment insertions, deletions, assimilations and weakenings) are consequences of these two kinds of variation in the gestural score.
19.3.1
Gestural overlap within and across tiers
When two gestures overlap in time, we expect to see different consequences (in actual articulator movement) depending on whether the two gestures are on the same or different articulatory tiers, that is, depending on the articulatory organization. Gestures on different tiers may overlap in time and yet proceed relatively independently of one another, without perturbing each other's trajectories, since they affect independent vocal tract variables. The possibility of such events on separate tiers "sliding" in time with respect to one another provides the basis for an analysis of the apparently diverse changes in (5). Across tiers. Example (5a) is described as an example of segment deletion. However, looking at this change in terms of the gestures involved, we hypothesize that the alveolar closure gesture for the / t / is still present in the fluent speech version, but that it has been completely overlapped, or "hidden," by the bilabial closure gesture. This means that the movement of the tongue tip towards the alveolar ridge and away again may occur entirely during the time that the lips are closed (or narrowed), so that there will be no local acoustic evidence of the alveolar closure gesture. Figure 19.12 shows the hypothesized variation in the symbolic gestural score for "must be." Only the oral subsystem is shown. In figure 19.12a, the alveolar closure precedes the bilabial closure. This implies that the gestural overlap is only partial. In figure 19.12b the gestures associated with " b e " have slid to the left so that the bilabial closure is effectively synchronous with the alveolar gesture. This view contrasts sharply with the more traditional description that there is a fluent speech rule that deletes the / t / in the appropriate environments. Under the latter hypothesis, one would not expect to find any articulatory movement associated with an alveolar closure. Articulatory evidence of such hidden closures is presented in the next section. 360
Tiers in articulatory phonology Tier
Gestures
Tongue Body
•
Tongue Tip
•
.
• P •
Lips
i
A CT
.
T
. p .
•
•
.
.
(a)
Tier
Tongue Body
Gestures
A
.
.
Tongue Tip Lips
i
• P •
•
.
o .
x .
. p.
.
.
(b) Figure 19.12 Hypothetical symbolic gestural scores (oral tiers only) for "must be". The symbol A has its usual phonetic interpretation in this figure, (a) Canonical form ([ / mAst#bi]]), (b) fluent speech form (['mAsbi]).
Example (5b) is described as an assimilation rather than a deletion. Nonetheless, the same kind of analysis can be proposed. The bilabial closure gesture may increase its overlap with the preceding alveolar gesture, rendering it effectively inaudible. The overlap of voicing onto the beginning of the bilabial closure yields the [bp] transcription. The possibility of analyzing assimilations in this way is also proposed in Kohler (1976) for German, and in Barry (1984). Brown (1977) also notes that it is misleading to view such changes as replacing one segment with another, but she does not propose a formal alternative. The combination of assimilation and deletion observed in (5c) can be analyzed in the same way. The bilabial closure gesture (associated with the /p/) increases its overlap with the alveolar closure gesture (associated with the /nd/). The fact that the velic lowering gesture for the / n / now overlaps the bilabial closure accounts for the appearance of [m]. Thus, these examples of consonant assimilation and consonant deletion are all hypothesized to occur as a result of increasing gestural overlap between gestures on separate oral tiers. 361
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
In fact, the most common types of place of articulation assimilations in casual speech do involve gestures on separate oral tiers. At least for RP, Brown (1977) claims that the most common cases involve alveolar stops assimilating to labials or velars (see also Gimson 1962). Thus, the common assimilation types represent two of the three possible combinations of gestures from two separate tiers. One might ask why labial-velar or velar-labial assimilations do not occur (at least not frequently), given the possibility of their overlapping. The answer to this question would involve studying the acoustic and perceptual consequences of overlapping vocal tract constriction movements (we intend to do this using the ability of our model to generate speech). A possible explanation lies in the fact (Kuehn and Moll 1976) that tongue tip movements show higher velocities than do either tongue dorsum or lip movements (which are about equivalent to each other). A slower movement might prove more difficult to hide. Within tiers. Gestures on the same articulatory tier cannot overlap without perturbing each other, since the same vocal tract variables are employed but with different targets. Thus, even partial overlap of gestures on the same tier leads to blending of the observed output characteristics of the two gestures. This same point has been made by Catford (1977), who distinguishes what he calls "contiguous" sequences (which typically involve the same tract variable in the present system), from " heterorganic " sequences (which typically involve different tract variables). The blending of gestures shows itself in spatial changes in one or both of the overlapping gestures. Variation in the overlap of gestures on the same tier can account for other types of assimilations (somewhat less common, according to Brown 1977) that cannot be accounted for in terms of the "hiding" notion proposed to account for the changes in (5). Some examples are shown in (6) (a,c from Catford 1977; b from Brown 1977): (6)
(a) /ten 'Glngz/ (b) /kAm fram/ (c) /Qls'Jap/
^[tEn'Blngz] -» [kArgfram] ^[dlj'jap]
("ten things") ("come from") ("this shop")
For example, in (6a), the overlapping production of the alveolar closure (associated with the /n/) and the dental fricative leads to a more fronted articulation of the alveolar closure (and perhaps more retracted articulation of the dental fricative). Similar interactions are found between a bilabial stop and labiodental fricative (in 6b), and between alveolar and palatoalveolar fricatives (in 6c). As Catford (1977) notes, the articulations involved (particularly in cases like 6c) show blending of the two goals into a smooth transition, rather than a substitution of one segment for another, as the transcription would indicate. Articulatory vs. functional structure. The examples of overlap and blending considered so far have all involved consonantal gestures. The consequences of gestures being on the same or different articulatory tiers are not restricted to 362
Tiers in articulatory phonology
consonantal gestures, however. Similar consequences have also been observed between consonantal and vocalic gestures. Thus, while consonant and vowel gestures overlap in any CV utterance, only in the case of velar consonants does this overlap occur on a single articulatory tier (tongue body), and in this case we would expect the gestures to show blending. Indeed they do, and this blending is usually described as fronting of the velar consonant in the context of following front vowels. In contrast, alveolar stops can be produced concurrently with a vowel without changing the place of articulation of the stop. Ohman's (1967) X-rays of the vowel tract during the closures for /idi/, /udu/, and /ada/ show global shapes determined almost completely by the vowel, but with a relatively invariant tongue tip constriction superimposed on the vowel shapes. For /igi/ and /ugu/, however, the actual position of the constriction is shifted. Thus, articulatory organization (i.e. whether gestures occur on the same or different articulatory tiers) appears to be relevant to questions of blending and overlap, regardless of the functional status of the gestures. That is, it is relevant for both consonantal and vocalic gestures. However, the effect of articulatory structure can also interact with the functional status of the gestures involved. For example, Keating (1985) points out the effect that language-particular distinctiveness requirements can have on coarticulation. She discusses Ohman's (1966) findings that Russian consonants, which involve contrastive secondary tongue body articulations (palatalization vs. velarization), do block V-to-V tongue body coarticulation, whereas English tongue body consonants, which involve contrastive primary articulations, do not block V-to-V tongue body coarticulation. Following Ohman, she models this effect by placing the secondary tongue body articulation for the Russian consonants on the vowel tier. Thus, the functional status of the "consonantal" tongue body gestures in the two languages also affects the amount of blending observed.
19.3.1.1
EVIDENCE FOR HIDDEN GESTURES
If our analysis of the changes involved in examples like (5) is correct, then it should be possible to find articulatory evidence of the "hidden" alveolar gesture. We examined the AT&T X-ray database (described in section 2) for examples of consonantal assimilations and deletions of this kind, by listening to sentences with candidate consonant sequences. Although there were very few examples of assimilations or deletions in the corpus (talking with lead pellets in your mouth and an X-ray gun pointed at your head hardly counts as a "casual" situation), the examples we were able to find do show the hidden gesture. For example, figure 19.13a shows the vertical displacements of lead pellets placed on the velum, tongue dorsum rear, tongue blade, lower lip and lower teeth, along with the acoustic waveform, for the utterance "perfect memory," spoken as a sequence of two words separated by a pause. The phonetic transcription aligned with the acoustic 363
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
Audio waveform
h
f
*
P
e
h
k t
m
e
m
"" ill
Velum
-
Tongue rear Tongue blade
^_^—"
Lower lip
^
Jaw 40
20
60
—
80
^
100
_
^
120
Time (frames) (a)
Audio waveform
P
h
f
3v
,
i
k
m
1
' i ' "'irmmn'
e
m
ill
3
v
j
i
\\\\ I
Velum Tongue rear Tongue blade Lower lip
—^_^-^ ^
— - ^ ^ ^~——
Jaw 20
40
60
80
———. 100
120
Time (frames) (b) Figure 19.13 X-ray pellet trajectories for "perfect memory." (a) Spoken in a word list ([paifekt#'mem...]), (b) spoken in a phrase ([patfek'mem...]).
waveform indicates that the / t / at the end of "perfect" is completely audible and its release is visible in the waveform. The time-course of the velar closure gesture associated with the /k/ in "perfect" is assumed to be reflected in the vertical displacement of the tongue dorsum (tongue rear) pellet. The relevant portion of this pellet trajectory is underlined in the figure and is labeled with the appropriate gestural symbol. Similarly, the portion of the tongue blade displacement 364
Tiers in articulatory phonology Audio waveform
n
b m
as
o s t a d a
Velum Tongue rear Tongue blade Lower lip Jaw 20
40
60
80
100
120
Time (frames) Figure 19.14 X-ray pellet trajectories for "nabbed most" (['naebmost], spoken in a phrase.
associated with the alveolar closure gesture for the / t / in "perfect," and the portion of the lower lip displacement associated with the bilabial closure gesture for the initial / m / in "memory" have been marked and labeled in the figure. Note that the velar and alveolar gestures partially overlap, indicating that velar closure is not released until the alveolar closure is formed. Thus, the onset of the alveolar gesture is acoustically "hidden" (it takes place during the velar closure), but its release is audible. Note the large amount of time between the release of the alveolar gesture and the onset of the bilabial gesture. Figure 19.13b shows the same two word sequence spoken as part of a sentence. Here, the final / t / in "perfect" is deleted in the traditional sense - careful listening reveals no evidence of the / t / , and no / t / release can be seen in the waveform. However, the alveolar gesture can still be seen quite clearly in the figure. It is even of roughly of the same magnitude as in figure 19.13a. What differs here is that the bilabial gesture for the initial / m / now overlaps the release of the alveolar gesture. Thus, both the closure and release of the alveolar gesture are now overlapped and there is, therefore, no acoustic evidence of its presence. The existence of this kind of phenomenon is consistent with Fujimura's (1981) iceberg account of the X-ray corpus from which these tokens are taken. He proposed that certain articulatory movements (like those forming the onsets and offsets of the gestures in figure 19.13) remain relatively invariant across phrase positions in which a word may occur, but that they may "float" relative to other icebergs. The hiding we observe here is a consequence of that floating. Another example of alveolar stop "deletion" where the alveolar closure gesture remains can be seen in figure 19.14. The same pellet trajectories are shown as in the previous figure. Here, the speaker (the same one shown in the previous figure) 365
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
is producing the phrase "nabbed most" in a sentence. As indicated by the phonetic transcription, the / d / at the end of "nabbed" has been deleted. The bilabial gestures associated with the / b / of "nabbed" and the / m / of "most" here overlap (forming a continuous closure), and the alveolar closure gesture, while quite robust kinematically, is once again irrelevant acoustically. The X-ray data also provide an example of assimilation in which an alveolar closure gesture is hidden by bilabial gestures. Figure 19.15 shows x-ray pellet trajectories for a (second) speaker producing the phrase "seven plus seven times..." Pellet locations are the same as for the first speaker, except that there is no velum pellet. As indicated in the transcription, the version in figure 19.15b contains an assimilation of the final / n / to [m]. Note however, that the alveolar closure gesture (as outlined on the tongue blade trajectory) is still present in the assimilated version. Comparing the two figures, it is not completely clear how to account for their different acoustic properties. They do not show the radical timing differences shown in figure 19.13. The difference may simply reside in the alveolar gesture being somewhat reduced in magnitude in the assimilated version. As suggested earlier, reduction in gestural magnitude is the other major kind of change that we expect to observe in casual speech, and we will return to this aspect of change in section 19.3.2. The example does clearly show, however, the presence of a hidden gesture (though perhaps somewhat reduced) in an assimilation. While the amount of data we have analyzed is still quite small, these tokens do demonstrate that gestural overlap can lead to apparent assimilations and deletions. Further supporting data were found by Barry (1985) who analyzed assimilations in a manner similar to that proposed here. He presents electropalatographic evidence of "residual" alveolar articulations in cases of alveolar-velar and alveolar-bilabial assimilations. Some electropalatographic evidence of such "residual" alveolars is also presented by Hardcastle and Roach (1979) and for German by Kohler (1976). Barry also shows cases of assimilation that do not include these residual articulations. Such cases may involve reduction in the magnitude of the hidden gesture (as suggested for figure 19.15b), and it is possible that such reduced gestures would not show up in a technique that depends on actual contact. Alternatively, some cases of assimilation may involve complete gesture deletion. Even deletion, however, can be seen as an extreme reduction, and thus as an endpoint in a continuum of gestural reduction, leaving the underlying representation unchanged. Indirect evidence in support of the hidden gesture analysis of alveolar stop deletion can be found in variable rule studies of deletion of final /t,d/ in clusters (Labov, Cohen, Robins and Lewis 1968; Guy, 1975; Guy, 1980; Neu, 1980). These studies show that the deletion is more likely to take place if the following word begins with a consonant than if it begins with a vowel. In studies with enough data to permit analysis according to the initial segment of the following word (Guy, 1980; Neu, 1980), the greatest deletion probabilities occur when the 366
Tiers in articulatory phonology s
e
v
n
p
I A
e
v
tn
n
Audio waveform Tongue rear Tongue blade Lower lip Jaw 20
40
60
80
100
120
Time (frames) (a)
Audio waveform
v
se
r
n
p
l
A
s
ev
n
th
a
Tongue rear Tongue blade Lower lip Jaw 20
40
60
80
100
120
Time (frames) (b) Figure 19.15 X-ray pellet trajectories for "seven plus seven times." (a) Not assimilated ([sevn#plAs...]), (b) assimilated ([sevmplAs...]).
following word begins with (true) consonants, followed in order by liquids, glides and vowels. This consonant-to-vowel ordering is exactly what we would expect when we consider the consequences of gestural overlap. In general, given comparable overlap patterns, the more extreme the constriction associated with a gesture, the better able that gesture is to acoustically mask (or aerodynamically interfere with) 367
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
another gesture with which it is co-occurring (cf. Mattingly 1981). Thus, overlap by a following consonant (stop or fricative) gesture would be most able to contribute to hiding of an alveolar closure gesture, while a following vowel would presumably not contribute at all to hiding the gesture (indeed, as noted above, overlap of consonants with vowels is the usual case). In the case of a following vowel, the alveolar gesture itself must either be severely reduced or overlapped completely by a preceding consonant gesture. Thus, the ordering of probabilities on deletion of final /t,d/ in clusters could follow directly from the view of deletion that we are proposing here, without these differential probabilities needing to be "learned" as part of a rule. This is consistent with Guy's (1980) demonstration that this consonant-liquid-glide-vowel ordering is robust across dialects and across individual talkers, while the ordering of other factors that contribute to deletion probability (e.g. morphological status of the final /t,d/) may differ greatly from dialect to dialect.
19.3.1.2
THE EMERGENCE OF VARIATION IN GESTURAL ORGANIZATION
While the evidence summarized supports an overlapping gestural analysis of consonant assimilations and deletions, we would also like to understand how such variation in gestural scores arises. If we think of gestural scores as providing an organization of gestures that allows them to overlap, yet to all be perceived, why, and under what circumstances, should this organization "fail," in the sense that some gestures become imperceptible? Recall that for V C(C)(C) V utterances, we proposed that the oral consonant gestures are phased with respect to the immediately preceding one (statement 3), as long as the sequence conforms to the language's syllable structure constraints. The original syllable affiliation of the consonants does not matter — what is important is the fact of well-formedness of the sequence. For those cases we examined, (e.g., [s#pl] vs. [#spl]), the sequences conform to possible syllable onsets, even when there is an intervening word boundary (i.e. [s # pi]). The cases that lead to complete gestural overlap, however, involve sequences that do not form possible syllable onsets (or codas), i.e. alveolar closure-bilabial closure and alveolar closure-velar closure sequences. We propose that in these cases (for example the one shown in figure 19.12b), the phasing principles in (3), and possibly (2) and (4) as well, are not followed. While it is not yet clear exactly what kind of phasing will be involved in these cases, we expect that the structure will not involve phasing the consonants to one another in sequence. The lack of such sequential phasing would then allow the kind of overlap we saw in figures 19.13b, 19.14 and 19.15b to emerge. The view we are proposing then suggests that the gestural organization in a language is exquisitely tuned to allow the successive oral gestures in syllable onsets and codas to overlap partially, without obscuring the information in these gestures. 368
Tiers in articulatory phonology
This view has been propounded by Mattingly (1981), who argues that "the syllable has more than a phonological or prosodic role; it is the means by which phonetic influences [cf. gestures] are scheduled so as to maximize parallel transmission" (p. 418). As Mattingly suggests, this organization guarantees, for example, that the constriction and release of an [1] will not occur completely during the bilabial closure in a [pla] utterance. However, for sequences that are not possible syllable onsets (or codas), we hypothesize that the production system does not have the same kind of tight organization available. Thus, for such sequences, variation in degree of overlap is possible, even to the point of obscuring the parallel transmission of information (i.e. one gesture may hide another). Some indirect evidence supports the view that status as possible cluster correlates with tightness of gestural organization. Alveolar stops frequently assimilate in alveolar-labial and alveolar-velar sequences, but assimilation (in either direction) is rare in labial-alveolar and velar-alveolar sequences. This could simply be due to the relative scarcity of word-final labial and velars compared to final alveolars (see Gimson 1962). A more interesting interpretation, from the current perspective, attributes these asymmetries to syllable structure differences: the labial-alveolar and velar-alveolar sequences are all possible syllable codas, and thus would be expected to show a tight phasing organization that prevents complete overlap. However, the details of the actual articulatory movements in such cases need to be examined before our proposal can be explicitly worked out and evaluated. In addition, the possibility that postconsonantal final alveolars are not always part of the syllable core (e.g. Fujimura and Lovins 1978) needs to be addressed. 19.3.2
Other casual speech processes
The aspect of an articulatory phonology that makes it potentially powerful for describing continuous speech is that diverse types of phonetic alternation segment insertion, deletion, assimilation and weakening - can all be described in terms of changes in gesture magnitude or intergestural overlap. That is, these alternations, which might require a number of unrelated segmental rules, can be given a unified account in the gestural framework. In the previous subsection, we showed how variation in intergestural overlap can give rise to apparent consonant deletions and assimilations in casual speech. In this subsection, we suggest how variation in gestural overlap, and also gestural magnitude, might yield some other types of alternations. In addition to consonant deletions, schwa deletions are common in casual speech (Brown 1977; Dalby 1984). Just as with the apparent consonantal deletions described above, such apparent schwa deletions might result from variation in gestural overlap. For example, the apparent deletion of the second vowel in "difficult" (Shockey 1973) might instead be an increase in overlap between the labiodental fricative gesture and the following velar closure gesture, so that the 369
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
fricative is not released before the closure is formed (see Catford 1977, on open vs. close transitions). Apparent segment insertions might also arise from variation in gestural overlap, as the following examples suggest: (7)
(a) /'sAmGlr)/ (b) /'saemsan]
-> ['sAmpOlrj] -> ['saempsan]
("something") ("Samson")
A number of authors (e.g. Ohala 1974; Anderson 1976) have analyzed the epenthetic stop in such nasal-fricative sequences as arising from variation in the relative timing of the velic closure gesture and the preceding oral closure gesture. In particular, if denasalization precedes the release of the closure gesture, then a short interval of oral closure will be produced. These changes in timing could be directly accommodated within our gestural structures. Brown (1977) also identifies a class of weakenings, or lenitions, in casual speech. Typical examples involve stop consonants weakening to corresponding fricatives (or approximants), as shown in (8): (8)
(a) /bl'kAz/ (b) /'mAst bi/
-> [pxAz] > ['mAs(3i]
("because") ("must be")
These changes might be described as decreases in the magnitude of individual gestures. The reduction in amplitude of movement associated with the gesture then leads to the possibility of an incomplete acoustic closure. (Additionally, reductions in magnitude may combine with increased overlap, leading to greater likelihood of a gesture being "hidden" by another gesture: see figure 19.15.) Such reductions of movement amplitude often, but not always, occur in fast speech (Lindblom 1983; Munhall, Ostry, and Parush, 1985; Kuehn and Moll 1976; Gay 1981). It may also be the case that gestural magnitudes are reduced simply because the speaker is paying less attention to the utterance (Dressier and Wodak 1982; Barry 1984). Reduction of gestural magnitude may be involved in changes other than weakenings. For example, the classic cases of intervocalic voicing assimilation could be described as a reduction in the magnitude of the glottal opening-and-closing gesture responsible for the voicelessness. If the magnitude of the opening is reduced sufficiently, devoicing might not take place at all. Data from Japanese (Hirose, Niimi, Honda and Sawashima 1985) indicate that the separation between the vocal folds at the point where voicing ceases at the beginning of an intervocalic voiceless stop is much larger than at the point where voicing begins again at the end of the stop. This suggests that if the magnitude of the abduction gestures were slightly reduced, the critical value of vocal fold separation for devoicing might never be reached. This is likely what is going on the data of Lisker, Abramson, 370
Tiers in articulatory phonology
Cooper, and Schvey (1969). Using transillumination to measure glottal opening in running speech in English, they found that the vast majority (89%) of English /ptk/ (excluding cases following initial / s / and environments that allow flapping) were, in fact, produced with glottal opening, but of these, 11 % showed no break in glottal pulsing. While the magnitude of glottal opening was not measured, we would hypothesize that these cases involved a decrease in magnitude of the opening gesture.
19.4
Summary
We have discussed a computationally explicit representation of articulatory organization, one that provides an analytical, abstract description of articulation using dynamically defined articulatory gestures, arranged in gestural scores. We first showed how canonical phonological forms can be described in the gestural framework, presenting preliminary results that syllable structure is best represented using separate vowel and consonant tiers, such that consonantal gestures overlap the vowel gesture with which they are associated. We also suggested that the vocalic gestures are most closely associated with the leftmost consonantal gesture in a consonant sequence, and that well-formedness of a consonant cluster is revealed by lack of variability in the overlap among the gestures constituting the cluster. We then showed how it might be possible to describe all reported phonological changes occurring in casual speech as consequences of variation in the overlap and magnitude of gestures. That is, in the gestural approach these two processes are seen as underlying the various superficially different observed changes. We presented some examples of "hidden" gestures, in which the articulations associated with the gesture were still observable in fluent speech, although there were no perceptible acoustic consequences of the gesture. We further discussed the importance of articulatory structure in fluent speech: overlap of gestures has very different consequences, depending on whether the gestures are on the same or different articulatory tiers. The gestural approach to casual speech is highly constrained in that casual speech processes may not introduce units (gestures), or alter units except by reducing their magnitude. This means that all the gestures in the surface realization of an item are present in their lexical representation; casual speech processes serve only to modify these gestures, in terms of diminution or deletion of the gestures themselves, or in terms of changes in overlap. Phonological rules of the usual sort, on the other hand, can introduce arbitrary segments, and can change segments in arbitrary ways. The gestural approach is further constrained by its reliance on articulatory structure, by its use of task dynamics to characterize the movements, and by our insistence on computational explicitness. All of these constraints lead to directions for future research. Will our suggested structures, both for the canonical and
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
casual speech forms, be confirmed by further articulatory studies ? Can articulatory structure provide simpler solutions to phonological problems such as the specification of language-particular syllable templates ? Do dynamic parameters, that in our system are attributes of gestures, participate in phonological patterns that are inherently different from those patterns involving articulatory structure ? Such questions are but part of the challenge for an articulatory phonology.
Appendix A Mass-spring model A simple dynamical system consists of a mass attached to the end of a spring. If the mass is pulled, stretching the spring beyond its rest length (equilibrium position), and then released, the system will begin to oscillate. The resultant movement patterns of the mass will be a damped sinusoid described by the solution to the following equation:
wherem b k x0 x x x
mx + bx + k(x — x0) = 0 = mass of the object = damping of the system = stiffness of the spring = rest length of the spring (equilibrium position) = instantaneous displacement of the object = instantaneous velocity of the object = instantaneous acceleration of the object
Note that the time-varying motion of the sinusoid is described by an equation whose parameters do not change over time. The equation constitutes a global constraint on the form of the movement; different trajectory shapes can be obtained by substituting different values for the parameters m, k, and x0. When such an equation is used to model the movements of coordinated sets of articulators, the "object" - motion variable - in the equation is considered to be the task, for example, the task of lip aperture. Thus, the sinusoidal trajectory would describe how lip aperture changes over time. In task dynamics, the task is treated as massless, since it is the motion of an abstract entity, the tract variable, that is being modelled, rather than the movement of physically massive articulators. For further details on task dynamics, see Saltzman (1986).
Appendix B Phonetic identity and stress Phonetic identity is conveyed primarily by the values of the equilibrium positions for the tract variables. The values of the equilibrium positions (targets) for the tract variables LA, TBCD, TTCD, GLO, and VEL refer to constriction degree, while the targets for LP, TBCL, and TTCL refer to the location of the constriction with respect to the upper or back wall of the vocal tract (LP refers to lip protrusion). While we have available to us the complete numerical continuum, our initial modeling relies on categorical approximations. Thus, for the oral constriction degree tract variables, there are a maximum of 7 discrete values: closure, " critical" (that constriction appropriate for generating frication), narrow, narrow-mid, mid, mid-wide, and wide. The tongue constriction location variables also have a maximum of 7 discrete values: dental, alveolar, alveo-palatal, palatal, velar, uvular, and 372
Tiers in articulatory phonology
pharyngeal. In the current preliminary (and in complete) formulation, the first three locations utilize TTCL and the last four TBCL. The glottal and velic tract variables are currently not explicitly modeled by task dynamics; instead the acoustic consequences of the articulatory movements are approximated by simple on-off functions. The glottal tract variable has three modes: one approximating an opening-and-closing gesture (for voicelessness), one approximating a (tight) closing-and-opening gesture (for glottal stops), and one for speech mode (voicing); the velic tract variable is binary valued, i.e. either open or closed. The durational and segmental variations associated with differences in stress level can be simply accounted for in terms of dynamically defined articulatory movements (Browman and Goldstein 1985; Kelso, Vatikiotis-Bateson, Saltzman, and Kay 1985; Ostry and Munhall 1985). Decreasing the stiffness k of the "spring" for the stressed vowel results in a slower trajectory, which corresponds to the longer duration associated with stress. Increasing the difference between the rest length (equilibrium position) of the spring (x0) and the initial position (x) increases the amplitude of the oscillation, which corresponds to the difference in displacement between a reduced and full vowel. For a consonant, say a labial closure gesture, either decreasing the stiffness or increasing the target (equilibrium position) will increase the duration of the acoustic closure, since in both cases there will be a longer period of time during which the lip aperture will be small enough to achieve acoustic closure. In our current implementation, different stress levels indicate different multiplicative factors used to modify the inherent dynamic parameters. While this approach appears to work well for the gestural scores we have tested so far, at least for stiffness, it needs to be tested in considerably more detail, including being compared to other possible implementations, such as additivity. Note Our thanks to Ignatius Mattingly, Eric Vatikiotis-Bateson, Kevin Munhall, Carol Fowler, Mary Beckman, John Kingston, and Doug Whalen for thoughtful comments on early drafts of this paper; to Yvonne Manning and Caroline Smith for help with the figures; and to Caroline Smith and Mark Tiede for facilitating the analysis of the X-ray data. This work was supported in part by NSF grant BNS 8520709 and NIH grants HD-01994, NS-13870, NS-13617 to Haskins Laboratories. References Anderson, S. R. 1976. Nasal consonants and the internal structure of segments. Language 52: 326-344. Barry, M. C. 1984. Connected speech: processes, motivations, models. Cambridge Papers in Phonetics and Experimental Linguistics 3: 1-16. 1985. A palatographic study of connected speech processes. Cambridge Papers in Phonetics and Experimental Linguistics 4: 1-16.
Borden, G. J. and T. Gay. 1979. Temporal aspects of articulatory movements for / s / stop clusters. Phonetica 36: 21-31. Browman, C. P. and L. Goldstein. 1985. Dynamic modeling of phonetic structure. In V. Fromkin (ed.) Phonetic linguistics. New York: Academic Press, 35-53. 1986. Towards an articulatory phonology. Phonology Yearbook 3: 219-252. Browman, C. P., L. Goldstein, J. A. S. Kelso, P. Rubin, and E. Saltzman. 1984. Articulatory synthesis from underlying dynamics. Journal of the Acoustical Society of America 75: S22-S23. (Abstract). 373
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
Browman, C. P., L. Goldstein, E. Saltzman, and C. Smith. 1986. GEST: A computational model for speech production using dynamically defined articulatory gestures. Journal of the Acoustical Society of America 80: S97. (Abstract). Brown, G. 1977. Listening to Spoken English. London: Longman. Catford, J. C. 1977. Fundamental Problems in Phonetics. Bloomington, IN: Indiana University Press. Clements, G. N. 1980. Vowel Harmony in Nonlinear Generative Phonology: An Autosegmental Model. Bloomington, IN: Indiana University Linguistics Club. 1985. The geometry of phonological features. Phonology Yearbook 2: 225-252. Clements, G. N. and S. J. Keyser. 1983. CV Phonology: A Generative Theory of the Syllable. Cambridge, MA: MIT Press. Dalby, J. M. 1984. Phonetic structure of fast speech in American English. Ph.D. dissertation, Indiana University. Dressier, W., and R. Wodak. 1982. Sociophonological methods in the study of sociolinguistic variation in Viennese German. Language and Society 11: 339-370. Fowler, C. A. 1980. Coarticulation and theories of extrinsic timing. Journal of Phonetics 8: 113-133. 1981. A relationship between coarticulation and compensatory shortening. Phonetica 38: 35-50. 1983. Converging sources of evidence on spoken and perceived rhythms of speech: Cyclic production of vowels in sequences of monosyllabic stress feet. Journal of Experimental Psychology: General 112: 386—412.
Fujimura, O. 1981. Temporal organization of articulatory movements as a multidimensional phrasal structure. Phonetica 38: 66-83. Fujimura, O., S. Kiritani, and H. Ishida. 1973. Computer controlled radiography for observation of movements of articulatory and other human organs. Computers in Biology and Medicine 3: 371-384. Fujimura, O. and J. Lovins. 1978. Syllables as concatenative phonetic units. In A. Bell and J. B. Hooper (eds.) Syllables and Segments. Amsterdam: North Holland, 107-120. Gay, T. 1977. Articulatory movements in VCV sequences. Journal of the Acoustical Society of America 62: 183-193. 1978. Articulatory units: segments or syllables? In A. Bell and J. B. Hopper (eds.) Syllables and Segments. Amsterdam: North Holland, 121-131. 1981. Mechanisms in the control of speech rate. Phonetica 38: 148-158. Gimson, A. C. 1962. An Introduction to the Pronunciation of English. London: Edward Arnold. Goldsmith, J. A. 1976. Autosegmental Phonology. Bloomington, IN: Indiana University Linguistics Club. Goldstein, L. and C. P. Browman. 1986. Representation of voicing contrasts using articulatory gestures. Journal of Phonetics 14: 339-342. Guy, G. R. 1975. Use and application of the Cedergren-Sankoff variable rule program. In R. Fasold and R. Shuy (eds.) Analyzing Variation in Language. Washington, DC: Georgetown University Press, 56-69. 1980. Variation in the group and the individual: the case of final stop deletion. In W. Labov (ed.) Locating Language in Time and Space. New York: Academic Press, 1-36. Halle, M. 1982. On distinctive features and their articulatory implementation. Natural Language and Linguistic Theory 1: 91—105.
374
Tiers in articulatory phonology
Hardcastle, W. J., and P. J. Roach. 1979. An instrumental investigation of coarticulation in stop consonant sequences. In P. Hollien and H. Hollien (eds.) Current Issues in The Phonetic Sciences. Amsterdam: John Benjamins AG, 531-540. Hayes, B. 1981. A Metrical Theory of Stress Rules. Bloomington, IN: Indiana University Linguistics Club. Hirose, H., S. Niimi, K. Honda, and M. Sawashima. 1985. The relationship between glottal opening and transglottal pressure difference during consonant production. Annual Bulletin of RILP 19, 55-64. Kahn, D. 1976. Syllable-based Generalizations in English Phonology. Bloomington, IN: Indiana University Linguistics Club. Keating, P. A. 1985. CV phonology, experimental phonetics, and coarticulation. UCLA WPP 62: 1-13. Kelso, J. A. S. and B. Tuller. 1985. Intrinsic time in speech production: theory, methodology, and preliminary observations. Haskins Laboratories. Status Report on Speech Research SR-81: 23-39. Kelso, J. A. S., E. Vatikiotis-Bateson, E. L. Saltzman, and B. Kay. 1985. A qualitative dynamic analysis of reiterant speech production: phase portraits, kinematics, and dynamic modeling. JfASA 11: 266-280. Kent, R. D., and K. Moll. 1975. Articulatory timing in selected consonant sequences. Brain and Language 2: 304—323.
Kohler, K. 1976. Die Instability t wortfinaler Alveolarplosive im Deutschen: eine elektropalatographische Untersuchung [The instability of word-final alveolar plosives in German: An electropalatographic investigation.] Phonetica 33: 1-30. Kuehn, D. P. and K. Moll. 1976. A cineradiographic study of VC and CV articulatory velocities. Journal of Phonetics 4: 303-320. Labov, W., P. Cohen, C. Robins, and J. Lewis. 1968. A study of the non-standard English of Negro and Puerto Rican speakers in New York City. Cooperative Research Report 3288. Vols. I and II. New York: Columbia University. (Reprinted by U.S. Regional Survey, 204 North 35th St., Philadelphia, PA 19104) Ladefoged, P., and I. Maddieson, I. 1986. Some of the sounds of the world's languages. UCLA Working Papers in Phonetics 64: 1-137. Liberman, A., F. Cooper, D. Shankweiler, and M. Studdert-Kennedy. 1967. Perception of the speech code. Psychological Review 74: 431-436. Lindblom, B. 1963. Spectrographic study of vowel reduction. Journal of the Acoustical Society of America 35: 1773-1781. 1983. Economy of speech gestures. In P. F. MacNeilage (ed.) The Production of Speech. New York: Springer-Verlag, 217-245. Lisker, L., A. S. Abramson, F. S. Cooper, and M. H. Schvey. 1969. Transillumination of the larynx in running speech. Journal of the Acoustical Society of America 45: 1544-1546. Macchi, M. J. 1985. Segmental and suprasegmental features and lip and jaw articulators. Ph.D. dissertation, New York University. Mattingly, I. G. 1981. Phonetic representation and speech synthesis by rule. In T. Myers, J. Laver, and J. Anderson (eds.) The Cognitive Representation of Speech. Amsterdam: North-Holland. 415—420. McCarthy, J. J. 1981. A prosodic theory of nonconcatenative morphology. Linguistic Inquiry 12: 373^18. Miller, J. E., and O. Fujimura. 1982. Graphic displays of combined presentations of acoustic and articulatory information. The Bell System Technical Journal 61: 799-810. 375
CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN
Munhall, K. G., D. J. Ostry, and A. Parush. 1985. Characteristics of velocity profiles of speech movements. Journal of Experimental Psychology: Human Perception and Performance 11: 457^74. Neu, H. 1980. Ranking of constraints on /t,d/ deletion in American English: A statistical analysis. In W. Labov (ed.) Locating Language in Time and Space. New York: Academic Press, 37-54. Ohala, J. J. 1974. Experimental historical phonology. In J. M. Anderson and C. Jones (eds.) Historical Linguistics Vol. 2. Amsterdam: North-Holland. 353-389. Ohman, S. E. G. 1966. Coarticulation in VCV utterances: spectrographic measurements. Journal of the Acoustical Society of America 39: 151-168. 1967. Numerical model of coarticulation. Journal of the Acoustical Society of America 41: 310-320. Oshika, B., V. Zue, R. Weeks, H. Neu, and J. Aurbach. 1975. The role of phonological rules in speech understanding research. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-23, 104-112. Ostry, D. J. and K. Munhall. 1985. Control of rate and duration of speech movements. Journal of the Acoustical Society of America 11: 640-648. Rubin, P., T. Baer, and P. Mermelstein 1981. An articulatory synthesizer for perceptual research. Journal of the Acoustical Society of America 70: 321-328. Saltzman, E. 1986. Task dynamic coordination of the speech articulators: a preliminary model. In H. Heuer and C. Fromm (eds.) Generation and Modulation of Action Patterns. Experimental Brain Research Series 15. New York: Springer-Verlag, 129-144. Saltzman, E. and J. A. S. Kelso. 1987. Skilled actions: a task dynamic approach. Psychological Review 94: 84-106. Selkirk, E. O. 1984. Phonology and Syntax: The Relation between Sound and Structure. Cambridge, MA: MIT Press. Shockey, L. 1984. Phonetic and phonological properties of connected speech. Ohio State Working Papers in Linguistics. 17: iv—143. Thrainsson, H. 1978. On the phonology of Icelandic preaspiration. Nordic Journal of Linguistics. 1(1): 3-54. Vennemann, T., and P. Ladefoged. 1973. Phonetic features and phonological features. Lingua 32: 61-74. Zwicky, A. 1972. Note on a phonological hierarchy in English. In R. Stockwell and R. Macauly (eds.) Linguistic Change and Generative Theory. Bloomington: Indiana University Press.
376
20
Toward a model of articulatory control: comments on Browman and Goldstein's paper OSAMU FUJIMURA
Browman and Goldstein provide an outline of a new quantitative model of the temporal organization of speech that deviates basically from the classical model of concatenation and coarticulation (see Fujimura 1987 for some relevant discussion of this issue). Admittedly, some of the current specifications of their model are vague, and further work is much needed. Also, I am not quite convinced that the model they propose will work, after all. My own approach is rather different (Fujimura 1987). Nevertheless, I welcome this proposal with enthusiasm and excitement. I share their sense of importance of the goal of establishing a quantitative model of temporal organization, and whoever's idea may turn out to be correct, it will be great to see some solution to this problem of our strong concern. The problem is very difficult to solve, but it has to be solved to be able to understand the essence of speech organization, to arrive at an adequate framework of phonological description, and also to apply such basic linguistic knowledge for true breakthroughs in speech technology. My first comment is that it is not clear to me what task dynamics actually contributes to the description proposed here for relating articulatory gestures to each other. I can see that the current specifications of the model seem consistent with the idea of task dynamics, but I still have to be convinced that further specifications which will need to be given to make the model adequate for describing the speech organization we observe, do not cause any theory-internal inconsistency. The local interpretation of time functions representing observed movement patterns can be achieved to a large extent by a piecewise approximation using other elementary mathematical time functions (see for examples Kent 1986: 238; Vaissiere, 1988), as well as by critically damped oscillatory functions. If I understand the paper correctly, Browman and Goldstein are not particularly concerned with the local accuracy of the curve fitting, but are more concerned with the validity of their general principle. In order to assert that the task dynamics approach is the correct one, we will have to examine a wider range of phenomena with respect to overall temporal organization of speech. In order to appreciate the 377
OSAMU FUJIMURA
consequence of assuming task dynamics as the biological principle underlying speech organization, therefore, we will need to ask how gesture specifications are generated out of prosodic as well as "segmental" representations at the level of abstract phonological description of speech. The substance of this question is: what determines the triggering time and initial conditions of each elementary event so as to relate different articulatory actions to each other in a time series ? Task dynamics prescribes that any relative timing be specified in terms of the phase angle of each underlying (undamped) oscillation, and Browman and Goldstein do attempt to follow this prescription. In order to specify an event that actually occurs according to the physical laws, however, the initial position and velocity (or some other boundary conditions) must be given in addition to the differential equation, but this aspect of the physics is not described in their presentation explicitly. They do address the issue of timing for individual events, by giving general rules for relative phases of contiguous gestural events. As far as I can see, however, the phase-angle specification based on task dynamics is equivalent to specifying event triggers to occur according to the classical notion of segmental duration assignment and time overlapping, except that the concept of segment is generalized to include transitional segments (diphones or, in my opinion more correctly, demisyllables). Of course, I agree with this generalization of the concept of segment. The use of inherent time constants for phonetic elements including (underlying) transitions is not novel; we have seen the same concept in classical speech synthesis rule systems as well as in the tradition of work at Haskins Laboratories (see for example the Fl cutback theory). However, it remains to be seen what generalization we obtain by using the specific form of phase-angle specifications and, presumably, inheritance of position but not velocity (it is not clear what initial velocity is given) from the previous oscillatory function at the point of triggering of the upcoming event. Another point I would like to make concerns Browman and Goldstein's interpretation of dynamic characteristics of movement patterns in terms of "stiffness." This mysterious terminology stems from their assumption of unit mass for the oscillatory system. Any faster movement is described as the result of the articulator's being stiffer. In many cases, however, a more realistic interpretation might be that the effective mass of the articulator is smaller. Normalizing the mass is appropriate only when the mass never changes within the domain of described phenomena. But this is not the primary issue here. Whatever the interpretation of the underlying principle, if it correctly describes the observed phenomena then it is a useful model, and we will have learned a great deal. And if the model is consistent with a more basic hypothesis about general principles of biology or physics, it is all the more interesting, and it can then be considered an explanation. Whether the model proposed here has descriptive or explanatory adequacy, however, remains to be seen. 378
Toward a model of articulatory control
A third general comment I have is about the articulatory model. It should be noted that there are so-called articulatory models which are efficient computational schemes for deriving acoustic signals out of vocal tract area functions. But such a model is not useful for the purpose of relating articulatory gestures to the speech signals unless it is combined with some true articulatory model. The vocal tract is strictly an acoustical (as opposed to articulatory) concept. In order to relate control gestures that play substantive roles in phonological representation to the variables and parameters that describe the articulatory states of component organs, particularly with respect to the temporal characteristics, we do need to compute directly the dynamic states of individual articulators. The vocal tract configuration must be derived by an articulatory model from these articulatory states. The model developed at Haskins Laboratories, based on the classical cylindrical model of the tongue (Coker and Fujimura 1966) could aim at a dynamic articulatory model in this sense (and so could Coker 1968). As discussed elsewhere (Fujimura and Kakita 1979; Kakita and Fujimura 1984), however, there are some basic and inherently nonlinear characteristics of articulatory control that can be captured only by a three-dimensional model (or its approximation based on recognition of such effects). This is not a critique of the current work, which is a good starting compromise, but an expression of hope for future innovations. In this connection, we should note that it is not the location of constriction along the vocal tract that matters directly in articulator-based phonological description, but the selection of an articulator, or an agent. This distinction is important in identifying controlled gestures as distinct from resultant physical gestures. For example, even velum lowering as related to the feature nasal, which is probably the articulatory correlate most independent of other articulatory controls, does show a concomitant narrowing in the oral vocal tract (see Maeda 1983 for a computational model considering this effect and a discussion of its perceptual relevance). Furthermore, a direct interaction of velum height and tongue dorsum height is observed during velar occlusion in our microbeam data. When the mechanical coupling between independent articulators is much stronger and still quantitatively less understood, as in the case of lip/mandible and tongue-tip/mandible interaction, the separation of control and physical state is more complex and difficult to interprete by mere conjecture. This is why an effective three-dimensional and anatomically correct model is badly needed. Motor equivalence of a goal-oriented coordinated structure is an attractive concept, and no doubt it captures some truth about the capability of biological systems for explaining observations in certain situations, but we should not overgeneralize our understanding of phenomena under limited circumstances to control principles of general speech behavior effecting phonological contrasts (Fujimura 1987). There are empirical observations suggesting differential uses of independent articulatory controls, such as Macchi's (1985) separation of lip vs. 379
OSAMU FUJIMURA
mandible controls, corresponding to segmental vs. suprasegmental categories of linguistic functions. The possibility of such differential uses of articulators leads to a final question I have regarding phonological structure in relation to articulatory gestures: what is the role of phonological word or phrase boundaries ? Browman and Goldstein apparently assume that consonant clusters are broken up based on the syllable structure constraints, regardless of word boundaries. This is not always so (see Davidsen-Nielsen 1974; see also Fujimura and Lovins 1977 for discussion). Also, we need to discuss the effects of phrase boundaries even just to describe words uttered in isolation, since utterances are normally produced according to some phrasing pattern, and the phrasal effects are quite large. My feeling is that in order to discuss temporal organization issues using real speech data, we need to have some model of the effects of phrasal and prosodic modulations (Fujimura 1987). My final comment: Browman and Goldstein's discussion of apparently ad hoc segmental alterations in casual speech is eloquent, and I hope phonologists will take this seriously as an incentive for considering articulatory data. In this connection, I have mentioned some additional phenomena in my discussion of the iceberg model (Fujimura 1981, 1986). Some allophonic reduction phenomena, such as t-flapping in American English, may be explained by a timing shift but not merely by closure duration manipulation. Also I suggested a nontraditional principle of temporal repulsion in some sequences of gestures using the same articulator. These issues remain to be studied with a systematic collection of actual articulatory data. We should not hesitate to applaud Browman and Goldstein in offering to us the first explicit model of multidimensional articulatory organization - t h e "score" that could replace the classical model of segmental concatenation and smoothing. While what they have specified could, in my opinion, still be reinterpreted without assuming task dynamics, it is important to have any explicit formalization of a new idea. Also, the quantitative treatment of relative timing among articulators no doubt will lead us to a new insight into the inherently multidimensional nature of the articulatory system. Such empirically motivated insight is invaluable, when we consider a reformulation of the abstract representation scheme of phonological structure.
References Coker, C. H. 1968. Speech synthesis with a parametric articulatory model. In Speech Symposium, Kyoto 1968. Reprinted in J. L. Flanagan and L. R. Rabiner (eds.) Speech Synthesis. Stroudsberg, Pennsylvania: Dowden-Hutchinson & Ross, 135-139. Coker, C. H. and O. Fujimura. 1966. Model for specification of the vocal tract area function. Journal of the Acoustical Society of America 40: 1271. (abstract) Davidsen-Nielsen, N. 1974. Syllabification in English words with medial sp, st, sk. Journal of Phonetics 2: 15-45. 380
Toward a model of articulatory control Fujimura, O. 1981. Temporal organization of articulatory movements as a multidimensional phrasal structure. Phonetica 38: 66-83. [Corrected version in A. S. House (ed.) Proceedings on the Symposium on Acoustic Phonetics and Speech Modeling, Part 2, Paper 5. Institute of Defense Analysis, Princeton, New Jersey.] 1986. Relative invariance of articulatory movements: an iceberg model. In J. S. Perkell and D. H. Klatt (eds.) Invariance and Variability in Speech Processes. Hillsdale, NJ: Lawrence Erlbaum Associates, 226-234. 1987. Fundamentals and applications in speech production research. In Proceedings of the 11th International Congress of Phonetic Sciences. Tallinn, Vol. 6, 10—27. Fujimura, O. and Y. Kakita. 1979. Remarks on quantitative description of the lingual articulation. In S. Ohman and B. Lindblom (eds.) Frontiers of Speech Communication Research. London: Academic Press, 17-24. Fujimura, O. and J. B. Lovins 1978. Syllables as Concatenative Phonetic Units. In A. Bell and J. B. Hooper (eds.), Syllables and Segments. Amsterdam: North-Holland, 107-120. [Full version distributed as a monograph by the Indiana University Linguistics Club, 1982.] Kakita, Y. and O. Fujimura. 1984. Computation of mapping from muscular contraction patterns to formant patterns in vowel space. In V. A. Fromkin (ed.) Phonetic Linguists. Orlando, FL. and London: Academic Press, 133-134. Kent, R. D. 1986. The iceberg hypothesis: the temporal assembly of speech movements. In J. S. Perkell and D. H. (eds.) Invariance and Variability in Speech Processes. Hillsdale, NJ: Lawrence Erlbaum, 234-242. Macchi, M. J. 1985. Segmental and suprasegmental features and lip and jaw articulators. Ph.D. dissertation, New York University. Maeda, S. 1983. Correlats Acoustiques de la Nasalization des Voyelles: Une Etude de Simulation, Centre National D'etudes des Telecommunications, Centre des Recherches de Lannion, Rept. VII. Vaissiere, J. 1988. Prediction of velum movement from phonological specifications. Phonetica 45: 122-139.
21 Gestures and autosegments: comments on Browman and Goldstein's paper DONCA STERIADE
21.1
Introduction
The primary goal of Browman and Goldstein's study appears to be that of modeling speech at a level of detail far greater than that to which a phonologist ordinarily aspires. For this reason, phonologists might be tempted to consider their gestural framework not as an alternative to the standard autosegmental model (the model presented in Goldsmith 1976; Clements 1985; and others) but rather as a way of beefing up the autosegmental representations so that they can begin to deal with certain subphonemic aspects of articulatory timing. In these comments, however, I will assume, that Browman and Goldstein are presenting a distinct theoretical alternative and that they advocate the adoption of gestural representations to the exclusion of autosegmental representations, not just as an interpretive appendix to them. I will begin by outlining the formal differences between gestures and autosegments. I will explain then why the distinct properties of the theory of autosegmental timing remain more useful in phonological analyses. In the last section of my comment, I will try to show that the points on which Browman and Goldstein's representations diverge from standard autosegmental theory have direct application to the description of certain phonological phenomena which have so far remained unexplained.
21.2
Gestures and autosegments
To understand what differentiates the gestural and autosegmental models we must look for answers to two questions: what types of units are the phonological representations made of and what types of relations obtain between these units ? 21.2.1 Units Let us consider the first question. Browman and Goldstein's gestures, the units of the gestural framework, correspond to a subset of the units countenanced in 382
Gestures and autosegments
autosegmental phonology: for instance, a bilabial closing gesture ((3 in figure 19.5a) corresponds to one instance of the articulator node Labial (Sagey 1986); a velic opening gesture (+ \\) corresponds to a [ + nasal] specification; a glottal opening and closing gesture (y) corresponds to one [ + spread glottis] value. There are some differences too. Gestures include not only information about articulators and the location of constriction but also about the manner in which the constriction is made: e.g. whether it is a near-closing gesture (as in the case of fricatives: cf. a in figure 19.5a) or a complete closure. The gestural model does not permit continuancy to be viewed as a distinct object from point of articulation. Although other manner features are not mentioned, I assume that they would also be viewed as integral parts of the gesture, rather than as distinct objects. In contrast, autosegmental models, specifically those of Clements (1985) and Sagey (1986), keep apart point of articulation autosegments from manner specifications. We do not know enough about the phonological patterning of manner features to evaluate this point of difference between the two models. But two observations can be made which indicate that a synthesis of these two approaches is necessary. Manner specifications are seldom if ever subject to assimilation rules, a fact which appears to support Browman and Goldstein's decision not to recognize them as distinct objects. On the other hand, as Sagey (1986) notes, assimilations in point of articulation do not necessarily propagate continuancy or any other manner features. Assimilations are analyzed gesturally as changes in the relative timing of two gestures: from sequentially ordered to overlapping or simultaneous. When two gestures become simultaneous, one may be perceptually "hidden" by the other: because it goes unnoticed, the hidden gesture may eventually disappear. Since the ultimate outcome of assimilation is the replacement of one gesture by the other, the gestural model predicts, contrary to fact, that all assimilations will involve both the replacement of point of articulation values and that of manner values. A revision may be necessary here. Aside from this point, the choice of units in the gestural model closely parallels that of articulator nodes and terminal features in the autosegmental models of feature organization of Clements (1985) and Sagey (1986). 21.2.2
Relations between tiers: direct vs. indirect timing
The fundamental difference between the two models emerges when we consider their analysis of the timing relations of precedence, simultaneity and temporal overlap. Gestures are elements with internal duration, a property represented abstractly as an internal 360 degree cycle. The duration allows one to represent directly temporal overlap between two gestures, by specifying the subsequence within each one that is synchronized with a subsequence in the other: for this reason I will refer to the gestural theory as a theory of direct timing. An illustration 383
DONCA STERIADE
of the gestural analysis of articulatory overlap appears in figure 19.4d of Browman and Goldstein's article. In contrast, autosegments are generally viewed as only points in time where no beginning and endpoints can be distinguished. Between two points one can define precedence and simultaneity but not partial overlap. For this reason and because the distinction between simultaneous and partially overlapping units is highly significant phonologically, standard autosegmental representations must represent articulatory timing as a relation mediated by one and possibly more auxiliary tiers, which play the role of external clocks. A theory of indirect articulatory timing is then central to the autosegmental framework. The auxiliary tiers mentioned above are the timing tier (the CV tier of McCarthy 1979; Clements and Keyser 1983; and the alternatives discussed in McCarthy and Prince 1986) and the class node tiers introduced in Clements (1985), Mohanan (1983), Sagey (1986). I illustrate in (1) the role of auxiliary tiers in defining partial overlap between autosegments. The sequence described is a homorganic nasal-stop cluster / m b / : the task is to represent the fact that a single autosegment, the articulator node Labial, stands in partial overlap with the specifications [ +nasal] and [ — nasal]. For present purposes we need to refer to only one auxiliary tier here: this is the timing tier, seen below in its CV-theory version. (1)
Labial C
C
[ + nas] [-nas]
What is represented in (1) is the overlap between labiality and nasality in a cluster of two consonants, a long sequence. Overlap between the same units can also be represented autosegmentally when it characterizes a short sequence, in this case a single prenasalized consonant: (2)
[ + nas] [ — nas]
21.2.3
Phonological length and indirect timing
Although auxiliary tiers are made necessary by the need to distinguish overlap from simultaneity between units lacking duration, their uses extend far beyond this aspect of autosegmental representations. The introduction of just one auxiliary tier permits one to represent as distinct three degrees of phonological length: long, short and extra-short. The Labial autosegment is long in (1), in that it is linked to more than one C. The same autosegment is short in (2), in that it occupies exactly one C. The nasal autosegments are short in (1), but extra-short 384
Gestures and autosegments
in (2), where they share a single C. These three degrees of phonological length turn out to be necessary in phonological descriptions. The long vs. short distinction is necessary in the representation of geminates or partially assimilated clusters such as / m b / : on this point there is an extensive literature (Leben 1980; Schein 1981; Kenstowicz 1982; Prince 1984; Hayes 1986; Schein and Steriade 1986) demonstrating that phonetically long segments must be phonologically represented as single autosegments multiply linked to two or more positions on the timing tier. Most of the references cited explicitly reject the alternative of representing long segments as sequences of adjacent, identical elements, i.e. in terms of what Browman and Goldstein (1986: 235) refer to as distinct, overlapping gestures on the same tier. The short vs. extra-short distinction is necessary in the description of contour segments2: units patterning phonologically as single segments but displaying a sequence of distinct specifications for a given feature, such as [nasal] or [continuant]. Prenasalized consonants, such as the /mb/ sequence in (2), and affricates are commonly cited examples of contour segments.3 Other instances of extra-short segments are the Finnish short diphthongs discussed by Keyser and Kiparsky (1984) and the intrusive stops analyzed by Clements (1987). Important in assessing the relative merits of the autosegmental and gestural models of articulatory timing is the fact that the distinction between long, short and extra-short autosegments follows without stipulation from the autosegmental model: a mapping between one autosegment and more than one timing unit corresponds to a long segment; a one-to-one mapping corresponds to a short segment; a mapping between several autosegments and one timing unit corresponds to an extra short segment. The autosegmental model does not use distinct primitives in describing the three possibilities: it merely combines the same units in all possible ways. In contrast, the gestural model of timing appears handicapped when one considers how it can be extended to account for existing segmental quantity distinctions. Browman and Goldstein have not yet considered this issue explicitly and should not be held responsible for my reconstruction of their possible moves on this matter. I see two options. One is to distinguish gestures in terms of their duration: to allow for every gesture a long cycle, a short cycle and an extra-short cycle. This will introduce into the gestural model the types of quantity distinctions that appear to be linguistically significant. While this is a possible move, it is not a desirable one because it leaves unexplained why it is precisely these three length values that must be introduced: why not two or thirteen degrees of length ? This question does not arise in the autosegmental framework, where the existing three distinctions correspond to the only conceivable mapping modes between timing slots and autosegments: a one-to-one mapping (short segments), many-to-one mapping (extra short segments) and a one-to-many mapping (long segments).4 A second option one could consider is to introduce a timing tier into the gestural model and thus to bring it even closer to the autosegmental framework. If this 385
DONCA STERIADE
option is taken, we must ask to what extent the remaining distinguishing characteristic of the gestural framework, the internal duration of gestures, is phonologically justified. I do not have a clear opinion on this issue, because the evidence bearing on it appears to point in opposite directions. On the one hand, one can argue against giving internal duration to gestures/autosegments on the grounds of restrictiveness. On the other hand, there is at least one class of phenomena, with arguably phonological status, which can be understood only in terms of the representations posited by Browman and Goldstein. I will now develop each one of these points in turn. 21.2.4
When simultaneous equals overlapping: an argument against units with internal duration It is obvious that if two systems of representation differ only in that one operates with units having internal duration while the other operates with punctual units, the latter will be the more restrictive one. This point can be illustrated by the following considerations. I have noted above that one can represent autosegmentally the articulatory overlap between nasality and labial closure in both long sequences (CC, as in (1)) and in short sequences (C as in (2)). However, the descriptive potential of autosegmental representations changes depending on whether we discuss long sequences or short ones. Suppose for instance that the nasality tier contrasts not two types of autosegments, [ -I- nasal] and [ — nasal], but only one autosegment, [ + nasal], and its absence. For a long sequence, this change of assumptions will not affect the representation of overlap: (3)
Labial
I [ + nas]
But for a short sequence, the change from a double-valued nasality tier to a single-valued one means that overlap can no longer be distinguished from simultaneity: (4)
Labial
I
C [ + nas] 5
In contrast, the representations proposed by Browman and Goldstein where gestures have internal duration make it in principle possible to distinguish partial overlap from simultaneity regardless of the length of the sequence and regardless 386
Gestures and auto segments
of whether each of the tiers involved is assumed to contain one or more than one gesture. (5)
Overlap on short sequence Tier 1 [ ] Tier 2 [ ]
Overlap on long sequence [ [
] ]
It is obvious that for the description of subphonemic aspects of articulatory timing the specification of internal duration in a gestural model is a valuable one. But the fact that autosegmental representations do not allow under certain conditions a distinction between overlap and simultaneity is also valuable, at a more abstract level of representation. An example which illustrates this is the distribution of tonal contours in languages such as Ancient Greek (Vendryes 1945; Steriade 1988) or Lithuanian (Halle and Vergnaud 1987 and references there). In these languages, short vowels are tonally specified as either High (H) or Low (L), while long vowels and diphthongs can be either H, L, falling (HL) or rising (LH). It is plausible to assume that the tonal tiers of both Greek and Lithuanian have only one underlying autosegment, H, and that what we call a L tone is, at the appropriate stage in the representation, simply the absence of H. If we make this assumption, we not only simplify the shape of underlying and intermediate representations but we also explain the asymmetry between tonal contours on long and short nuclei. The range of possible distinctions available on long nuclei is depicted below: (6)
H
H
H
I I l\ VV; VV; VV;
VV
(VV; VV; VV;
VV)
On short nuclei, the lack of distinction between overlap and simultaneity contracts these distinctions to just two: (7)
H
I V; V (V;
V)
It is important to note that such an analysis predicts that every language in which H is the only underlying tone will exhibit the same asymmetry as Greek: long vowels may exhibit a three or four-way tonal contrast, whereas short vowels will exhibit only a binary contrast. Note that if we adopt a theory like Browman and Goldstein's, in which autosegments/gestures are directly timed relative to 387
DONCA STERIADE
each other, without the intermediary of a timing tier, the neutralization of the LH, HL, H contrast on short vowels must be stipulated. It can perhaps be described: one can require that the gesture equivalent to a H tone have a certain minimal duration, not significantly shorter than that of a short vowel gesture. But this is necessarily a language-specific stipulation, since languages exist in which no such restriction obtains: the best known example of this type is Mende (Leben 1978), where both long and short vowels exhibit contours. Not accidentally, the tonal phonology of Mende, as analyzed by Leben, displays the necessity of an underlying distinction between H tones, L tones and toneless vowels and thus provides an independent difference between Mende and Greek or Lithuanian. The correlation hypothesized here suggests that languages like Mende, which allow tonal contours on short nuclei, will always be languages in which at least two tones will be present underlyingly; whereas languages like Greek and Lithuanian, where contours cannot occur on short nuclei, will always be languages with only one underlying tone. If this correlation holds up, it should count as an important advantage of the autosegmental theory of timing, since that theory, together with the assumptions of underspecification, explains it by restricting the circumstances under which overlap and simultaneity can be distinguished. This advantage of the autosegmental theory follows directly from the assumption that autosegments do not have internal duration. 21.2.5
Dorsey's Law. an argument for internal duration
There is, however, a class of phenomena which point in the opposite direction and suggest that there exist phonological uses of the assumption that autosegments have internal duration. I will refer globally to these processes as Dorsey's Law, using the name of the Winnebago rule which exemplifies the entire class. To anticipate, Dorsey's Law looks like a vowel insertion rule which turns CCV(C) syllables into CvCV(C) sequences, where v stands for a copy of V. The problem raised by Dorsey's Law for a standard autosegmental analysis, is that it is difficult to ensure that the quality of the inserted vowel will match that of the vowel tautosyllabic with the cluster in underlying representation. A standard vowel insertion process consists of two steps: inserting the V slot (or equivalent timing unit) and associating with it the appropriate segment. In the case of Dorsey's Law, the insertion of a V slot in a string like /i.tra/ creates a new syllable: /i.tV.ra/. At this point, it is no longer possible to determine that the syllable /tV/ was at some point a part of the last syllable /tra/: it is therefore not possible to tell whether the V in /tV/ should associate to the features of the preceding /[/ or of the following /a/. We shall see that this problem does not arise in Browman and Goldstein's gestural framework. In Winnebago a vowel is inserted between all underlying consonant clusters of the form Obstruent-Sonorant. This process is referred to as Dorsey's Law by Siouanists and has been analyzed by Miner (1979, 1981), Hale and White Eagle 388
Gestures and autosegments
(1980) and, most recently, by Halle and Vergnaud (1987). The inserted vowel is always a copy of the vowel which appears to the right of the cluster. The following examples come from Hale and White Eagle's study: (8)
a. sh-wa-zhok^shawazhok 'you mash potatoes' b. ho-sh-wa-zha->hoshawazha 'you are sick' c. hi-kro-ho^hikoroho 'he prepares' d. hi-ra-t'at'a-sh-nak-shana^hirat'at'ashanakshana 'you are talking' f. wakri-pras^wakiriparas 'flat bug' g. wakri-pro-pro^wakiriporoporo 'spherical bug'6
Two points must be clarified about the operation of Dorsey's Law. Both Hale and White Eagle and Halle and Vergnaud point out that the rule applies after stress is assigned but that the inserted syllable determines, under certain circumstances, a metrical restructuring of the word. This fact indicates (a) that Dorsey's Law applies after syllabification (since it applies after stress and stress is necessarily dependent on syllabification); and (b) that Dorsey's Law inserts, directly or indirectly, a new syllable into the string. The second point can be directly established by observing that some of the syllables resulting from Dorsey's Law appear as stressed in the forms above. The fact that Dorsey's Law applies only between obstruents and sonorants in prevocalic position, a type of sequence frequently analyzed as a CCV syllable, suggests that the clusters it singles out are complex onsets. The hypothesis is then that Dorsey's Law turns an underlying CCV syllable into a CVCV sequence, by copying the nuclear vowel between the members of the complex onset. This suggestion is independently supported within Winnebago by the following detail, mentioned by Miner (1981) and further analyzed by Saddy (1984): Dorsey's Law does not apply to heteromorphemic VC][CV sequences. Thus the /kna/ sequence in /waak-nak-ga/ 'man-SITTING POSITIONAL-DEMONSTRATIVE' (surface /waagnaka/) does not become /kana/. 7 In contrast, the /kn/ of monomorphemic /hiruknana/ 'boss' undergoes Dorsey's Law: the surface form is /hirukanana/. The failure of Dorsey's Law in VC][CV contexts is accounted for by assuming that syllabification proceeds cyclically: on the first cycle, the syllable /waak/ is formed, with /k/ in coda position. Thereafter, this /k/ is unavailable for ^syllabification into a complex onset, for general reasons discussed by Prince (1985) and Steriade (1988). Because the /kn/ cluster remains heterosyllabic in this case, Dorsey's Law fails to apply and we obtain /waag.na.ka/. In contrast, the tautomorphemic sequence /hiruknana/ is syllabified /hi.ru.kna.na/ on the first cycle, after which the onset cluster /kn/ undergoes Dorsey's Law. Thus cyclicity of syllabification and the assumption that Dorsey's Law affects only tautosyllabic clusters explains the two different treatments of /kna/ in /waagnaka/ and /hirukanana/. 8 The phenomenon I call Dorsey's Law is wide-spread and may occur in 389
DONCA STERIADE
languages whose underlying syllabification can be determined more directly. In Late Latin, this process used to operate sporadically, as we can tell from the spelling of inscriptions and from its inherited effects on the Romance languages. The Late Latin syllabification is reconstructible from a variety of phonological indications discussed in Steriade (1987): the complex onsets consist of obstruentliquid clusters, or perhaps more generally of obstruent-sonorant clusters. The epigraphic evidence collected by Schuchardt (1867) demonstrates that these clusters are frequently broken up by a copy of the vowel to their right.9 (9)
A.lek.san.dri: Alexandiri Mi.tra: Mitara pa.tri: patiri chla.my.dem: chalamydem cri.brum: ciribrum scrip.tum: sciriptum The.o.pras.tus: Theoparastus Clau.di.a.nus: Calaudianus pro.cras.ti.na.ta: procarastinata
Among the Romance languages, Sardinian appears to have preserved and extended Dorsey's Law. The following examples are among those cited by Wagner (1907): (10)
urn.bra: umbara ci.li.bru: ce.li.vu.ru co.lo.bru: co.lo.vu.ru xu.cla: cu.ca.ra
Two properties of Dorsey's Law, in both the Winnebago and the Romance version, should be stressed. First, the clusters separated by epenthesis are underlying onsets: this may not be completely clear in Winnebago but is beyond doubt in Romance. Second, the quality of the inserted vowel matches that of the underlying nucleus tautosyllabic with the complex onset: compare for instance /Mitara/ (from /Mi.tra/) with /patiri/ (from /pa.tri/). Two aspects of the gestural model can be combined to explain these facts. First, gestures have duration. Second, Browman and Goldstein suggest that within a syllable consonantal articulations are superimposed on the vowel gesture: simplifying somewhat, we can represent gesturally a syllable like /pra/ as below: (11)
Tier tongue body
Gestures a [
] r
tongue tip
[
] P
lips
[
]
390
Gestures and autosegments
Dorsey's Law can then be viewed as just a change in the relative timing of the three gestures. From an input syllable beginning with two consonantal gestures overlapping in duration with each other, a delay in the onset of the liquid can create a sequence in which the two gestures have ceased to overlap. A significant delay will create a sequence in which the vowel gesture begins to "show" between the consonantal gestures: (12)
Tiers tongue body
Gestures a [
] r
tongue tip
[
lips
[
]
* ]
To complete the account, we must add the following assumption about the syllabic interpretation of overlapping vocalic and consonantal gestures: a vowel gesture is interpreted as a monosyllable only if all the superimposed consonantal gestures are peripheral, that is only if the beginning of a contiguous cluster of consonantal articulations coincides with or precedes the beginning of the vocalic gesture (or, in the case of a postvocalic cluster, only if the end of the cluster coincides with or follows the end of the vowel). Since Dorsey's Law creates a sequence in which a consonant gesture has come to be nonperipherally superimposed on a vowel gesture, it automatically turns a monosyllable into a disyllable. What is intuitively satisfying about this analysis is the fact that Dorsey's Law can now be viewed as the effect of a single timing adjustment. As Brownman and Goldstein point out, most phonological processes originate as changes in the timing between articulations: the possibility of viewing Dorsey's Law in the same terms is a significant attraction of the gestural model. 21.2.6
Variants of Dorsey^s Law
It is in principle possible that the displacement of the second consonant in a syllable initial cluster would carry it all the way to the opposite end of the syllable. In this case, the displaced consonant will be peripheral and the result of displacement will still be interpretable as a monosyllable. (13)
Tier tongue body
[
tongue tip
[
hps
[
Gestures input a
]
[
]
•
P
[
P ]
(pra)
Gestures output a
[
j (par)
] ]
DONCA STERIADE
This type of change is also encountered in Latin, although during an earlier stage than Dorsey's Law. Early and prehistoric Latin displays occasional metathesis between the second member of a complex onset and the following vowel. This phenomenon was documented in some detail by Juret (1921)10: (14)
trapezita *plumo: *dluk w is
tarpezita pulmo: dulcis
"table" "lung" "sweet"
The gestural analysis of Dorsey's Law given above extends naturally to this case. We can suggest that the intra-syllabic movement of the consonantal gestures has two sets of parameters: the direction (leftwards/rightwards) and whether the target position is peripheral or not. Dorsey's Law is rightwards movement to a nonperipheral position, whereas liquid/vowel metathesis is rightwards movement to a peripheral position. The suggestion then is that Dorsey's Law and intrasyllabic methathesis are in fact one and the same phenomenon: a significant delay in the onset of the second consonant gesture of a complex onset. Leftward intra-syllabic movement is also encountered, in both the peripheral and the nonperipheral versions.11 Proto-Slavic CVR(C) syllables (where R is a liquid, / I / or /r/) shifted their liquids leftwards: South, Central and certain Western Slavic dialects turned CVR(C) to CRV(C), while Eastern Slavic turned CVR(C) to CVRV(C). The Eastern Slavic development is the inverse of Dorsey's Law; the other dialects display the inverse of Latin intrasyllabic metathesis. The history of Slavic CVR syllables is told in Meillet (1934), Vaillant (1950) and Shevelov (1963). The following examples come from Shevelov (1963: 391—421). (15)
Pre-Slavic *karv*bergh*ghord-
*melk*wolgh-
"cow"12 "birch" "yard" "town" "milk" "moisture" "force"
Eastern Slavic Russian/korova/ Russian/bereza/
Elsewhere in Slavic Slovak/krava/ Slovak/breza/
Russian /gorod/
Polabian /breza/ Old Church Slavic /gradu/
Russian /moloko/ Ukrainian/voloha/
Old Church Slavic /mleko/ Slovak/vlaha/ Serbo-Croatian /vlaga/
The data examined so far illustrate all four settings of the parameters of intrasyllabic movement: rightwards movement occurs in Winnebago and Romance, leftwards movement in Slavic. Movement to nonperipheral position appears in Late Latin (/Mitra/ -> /Mitara/) and Eastern Slavic (/*gord-/ -> /gorod/); movement to peripheral position appears in early Latin (/trapezita/ -> /tarpezita/) and in most other Slavic dialects (/*gord-/ -> /gradu/). Note that the target position of the moved consonantal gesture must always be a point in the duration of the vowel it is superimposed on (i.e. the tautosyllabic 392
Gestures and autosegments
vowel): this fact explains why sequences like /Mi.tra/ yield /Mi.ta.ra/ rather than */Mi.ti.ra/. There is a categorical aspect to intrasyllabic movement, in that different dialects select consistently either movement to peripheral or to nonperipheral position and do not appear to mix the two. What is more interesting is that the actual size of the displacement to a nonperipheral position may remain undetermined. We can guess at this from the fact that the inscriptions which attest the Latin version of Dorsey's Law provide a second type of spellings, seen below: (16)
li.bras: liberas sa.crum: sacerum su.pra: supera pa.tri: pateri (cf. patiri in (9)) gra.ci.lis: geracilis glo.ria: geloria tri.bu.na.tu: teribunatu
I take the intrusive / e / of /sacerum/ to stand for a vowel of indeterminate quality, a vowel that does not fall squarely within any phonemic category for which there is a letter in the Latin alphabet. This said, the difference between Dorsey's Law in /patiri/ and /pateri/ (both from /pa.tri/) could plausibly be attributed to the difference between a large enough displacement to leave behind an identifiable vowel quality (in /patiri/) and a displacement that is too small for that purpose (/pateri/). My suggestion then is that there is free variation in the actual size of the timing adjustment that yields the effects we call Dorsey's Law: the only constant aspect of the adjustment is whether the target position of the movement is peripheral or not. In this respect, intra-syllabic movement resembles a phonological rule. One aspect of intra-syllabic movement not discussed here is its relation to conditions of well-formedness on the resulting syllable. Dorsey's Law, as well as rightwards intrasyllabic metathesis (/tra/ -> /tar/) apply only to syllables that have complex onsets: changes like /ra/ -> /ara/ or /ra/ -> /ar/ are not attested. This could be attributed to the fact that such applications of intrasyllabic movement create onsetless syllables. The account sketched here rests on the possibility of distinguishing several points in the duration of the vowel. To my knowledge this is the only arguably phonological phenomenon that requires this assumption and, as such, deserves more careful investigation. In closing, I would like to explain why the paradigm presented here favors a gestural analysis over autosegmental alternatives.13 An essential part of the analysis is Browman and Goldstein's idea that vowel and consonant gestures are superimposed within a given syllable. This idea can be implemented autosegmentally, by associating the features of the nuclear vowel to the matrix of tautosyllabic consonants. To do so, I assume the general outlines of Clement's (1985) and Sagey's (1986) proposals about feature organization. The 393
DONCA STERIADE
change from /pra/ to /para/ can then be viewed as the insertion of a vowel position between the members of the complex onset. I represent below two steps in this process: the insertion of the vowel and its association to the set of tongue-body specifications of the surrounding segments. The vowel specifications are represented on the Dorsal tier; those of/p/ and / r / on the Labial and Coronal tiers respectively. (17)
Dorsal tier Labial tier Coronal tier Place tier Root tier skeleton
Given the input structure assumed, in which all tautosyllabic segments are associated to the same set of dorsal ( = tongue body) specifications, the inserted vowel cannot link up to any other dorsal values. What remains unexplained, however, is the relation between metathesis and vowel insertion: the relation between the treatment of /*bergh-/ in South Slavic /breza/ and in East Slavic /bereza/. Each option can be described autosegmentally: insertion as shown above and metathesis as insertion plus deletion of the original V slot. But the autosegmental analysis fails in that it provides no connection between the two operations involved in metathesis. We could equally well have coupled vowel insertion with any other change in the relevant syllable: a second vowel insertion, a consonant insertion or deletion, an unrelated assimilation rule. In contrast, the gestural analysis of intra-syllabic movement has a built-in account of such dialectal variation, since the target of movement can be either a peripheral or a nonperipheral position within the relevant domain. No' other options can be conceived of.
21.3
Conclusion
I hope to have shown here that one central aspect of the gestural theory of articulatory timing deserves the close attention of phonologists: the idea that the units of phonological representations are not points in time but rather elements endowed with internal duration. 394
Gestures and autosegments Notes 1 See, however, Sagey's (1986) remarks on the crossing-lines constraint, which suggest that at least one axiom of the autosegmental model can be derived from general extralinguistic principles if one assumes that autosegments do have internal duration. 2 A recent and detailed discussion of this class of phonological structures appears in Sagey (1986). 3 The data provided by Browman and Goldstein (1986) support the representation in (2) for prenasalized consonants, showing that the labial gesture of Chaga prenasalized /mb/ is as short as that of a plain / m / or / p / . More puzzling is the fact that English /mp/, /mb/, which are normally interpreted as phonologically long sequences (as in (1)), turn out to have labial gestures of the same length as simple / p / , / b / , / m / when pronounced in the same V'_V context. Given that English lacks true geminate stops, it is impossible to tell whether we are dealing here with a subphonemic shortening of the /mb/, /mp/ sequences or with a stress-conditioned lengthening of/m/, / p / , / b / . One should stress that, no matter how the English facts turn out, there is no systematic phonological equivalence between the quantity of homorganic nasal-stop clusters and that of corresponding simple stops. 4 One should bear in mind that phonological rules and representations do not count: they can distinguish between one and many units but not between one and exactly three (on this see also McCarthy and Prince 1986). This is why no phonological distinction exists between one autosegment linked to two slots and one linked to three: both are long, in that both display the one-to-many mapping mode. 5 This point is relevant not only for a comparison between the gestural and the autosegmental models but also for the comparison between autosegmental models operating with binary features and those operating exclusively with single-valued features. Note for instance that no framework which combines adherence to the autosegmental theory of articulatory timing with the assumption that all features are single-valued can describe prenasalized stops as distinct from nasal stops. 6 The digraphs <sh) and <(zh) denote palatal fricatives, and not clusters. 7 The surface form, /waagnaka/, is derived by a later rule which voices obstruents before consonantal sonorants. In contexts where Dorsey's Law has applied, the voicing rule is bled: hence /hirukanana/ from /hiruknana/, rather than */hiruganana/. 8 One should also point out that the syllabification of heteromorphemic VC-CV sequences differs from that of C-CV sequences, such as /sh-wa/ in /sh-wa-zhok/ -> shawazhok. The latter type of cluster is syllabified as an onset, and is hence a possible input to Dorsey's Law, while the former is not. This difference in syllabification can be explained simply. A (C)VC morpheme can be independently syllabified on a first cycle as a (C)VC syllable, with a coda. In contrast, a bare C morpheme cannot be independently syllabified, since Winnebago syllables must contain vowels: the C remains therefore syllabically stray until it can be adjoined to an existing syllable. This is what happens in the case of C-CV morpheme sequences, where the bare C is incorporated as onset into the existing CV syllable of the adjacent morpheme. 9 Schuchardt's collection of vowel insertions is not limited to instances of Dorsey's Law. One also finds cases like /maginam/ from /magnam/, or /exaceto/ from /exacto/: insertions of this type are however, very infrequent and without counterparts in the data inherited by the Romance languages. It is impossible to tell whether they represent sporadic changes in the pronunciation of individual items or spelling mistakes. Other, more regular cases of vowel insertion encountered in Shuchardt's data are discussed below. 395
DONCA STERIADE
10 The asterisks in the following forms denote items reconstructed on the basis of comparative evidence. 11 Latin leftwards movement to a nonperipheral position is occasionally attested in late inscriptions: for instance /inide/ from /in.de/, /inker/ from /in.ter/, /Militiades/ from /Mil.ti.a.des/. However sporadic, leftward nonperipheral movement was apparently continued in Sardinian, as examples like /saragus/ from /sar.gus/ (cited by Wagner 1907) indicate. 12 The Slavic root-final consonant clusters like /rv/ in /karv/ 'cow' were heterosyllabic when a vowel-initial suffix followed: /kar.vV.../. In such cases, we have a superficial asymmetry between Dorsey's Law (rightwards intra-syllabic movement) and its inverse (leftwards intrasyllabic movement): Dorsey's Law takes place only when the moving consonant is preceded by at least one tautosyllabic segment whereas the inverse of Dorsey's Law is not conditioned by the presence of any following segment. The reason for this difference has to do with the overwhelming preference for syllables with onsets: had Dorsey's Law applied in CV syllables, it would create V.CV sequences containing an onsetless initial. This is apparently being avoided. A related observation can be made about intrasyllabic leftwards movement in Slavic: all dialects of Slavic turn onsetless syllables such as */alk-/ 'hungry' into /lak-/ (Russian /lakat'/ 'lap', Slovak /lakat'/ 'be hungry'), even though Eastern Slavic is expected to yield /alak-/ (on the pattern of */gord-/ -> /gorod/). This too can be attributed to the tendency to avoid onsetless syllables. 13 Thanks to John McCarthy for suggesting the autosegmental analysis discussed below.
References Browman, Catherine and Louis Goldstein. 1986. Towards an articulatory phonology. Phonology Yearbook 3: 219-252.
Clements, George N. 1985. The geometry of phonological features. Phonology Yearbook 2: 225-252. 1987. Phonological feature representation and the description of intrusive stops. To appear in the CLS 23, Parasession on Metrical and Autosegmental Phonology.
Clements, George, N. and Samuel J. Keyser. 1983. CV Phonology. Linguistic Inquiry Monograph, MIT Press. Goldsmith, John. 1979. Autosegmental Phonology. New York: Garland Press. Hale, Kenneth and Josie White Eagle. 1980. A preliminary metrical account of Winnebago accent. International Journal of American Linguistics 46.
Halle, Morris and Jean-Roger Vergnaud. 1987. An Essay on Stress. Cambridge, MA: MIT Press. Hayes, Bruce. 1986. Inalterability in CV phonology. Language 62: 321-351. Juret, Alphonse. 1921. Manuel de phonetique latine. Paris. Kenstowicz, Michael. 1982. Gemination and spirantization in Tigrinya. Studies in the Linguistic Sciences 12(1): 103-122.
Keyser, Samuel J. and Paul Kiparsky. 1984. Syllable structure in Finnish phonology. In M. Aronoffand R. Oehrle (eds.) Language Sound Structure. Cambridge, MA: MIT Press. Leben, William. 1978. The representation of tone. In V. Fromkin (ed.) Tone: A Linguistic Survey, Academic Press. 1980. A metrical analysis of length. Linguistic Inquiry 11: 497-509. 396
Gestures and autosegments McCarthy, John J. 1979. Formal problems in Semitic phonology and morphology. Ph.D. dissertation, MIT. McCarthy, John J. and Alan Prince, 1986. Prosodic morphology. MS. University of Massachusetts, Amherst and Brandeis University. [To appear: MIT Press]. Meillet, Antoine. 1934. Le Slave Commun. Paris. Miner, Kenneth. 1979. Dorsey's Law in Winnebago-Chiwere and Winnebago accent. International Journal of American Linguistics 45: 25-33. Miner, Kenneth. 1981. Metrics, or Winnebago made harder. International Journal of American Linguistics 47: 340-342. Mohanan, K. P. 1983. The structure of the melody, MS. MIT and Stanford University. Prince, Alan. 1984. Phonology with tiers. In M. Aronoff and R. Oehrle (eds.) Language Sound Structure. Cambridge, MA: MIT Press. 1985. Improving tree theory. In Proceedings of the Annual Meeting of the Berkeley Linguistics Society vol. 11. Saddy, Douglas. 1984. Dorsey's Law in Winnebago. MS. MIT. Sagey, Elizabeth. 1986. The representation of features and relations in non-linear phonology. Ph.D. dissertation, MIT. Schein, Barry. 1981. Spirantization in Tigrinya. In H. Borer and J. Aoun (eds.) Theoretical Issues in the Grammars of Semitic Languages, MIT Working Papers in Linguistics vol. 3. Schein, Barry and Donca Steriade. 1986. On Geminates. Linguistic Inquiry 17: 691-744. Schuchardt, Hugo. 1867. Der Vokalismus des Vulgarlateins. Vol. 2. Leipzig: Teubner. Leipzig. Shevelov, George. 1965. A Prehistory of Slavic. Columbia University Press. Steriade, Donca. 1987. Syllabification and stress shift in proto-romance. In J.-P. Montreuil (ed.) Proceedings of the 16th Linguistic Symposium on Romance Linguistics. Dordrecht: Reidel, 371-410. Steriade, Donca. (To appear). Greek accent: a case for preserving structure. Linguistic Inquiry 19. Vaillant, Andre. 1950. Grammaire comparee des langues slaves, I. Phonetique. Vendryes, Joseph. 1945. Traite d1 accentuation grecque. Paris: Klincksieck. Wagner, Heinrich. 1907. Lautlehre der Sardinischen Mundarten. Halle.
397
22 On dividing phonetics and phonology: comments on the papers by Clements and by Browman and Goldstein PETER
LADEFOGED
Osamu Fujimura tried to bridge the gap between these two great papers. I am going to try to deepen the gulf. Whatever their authors' intentions, the gap between their theoretical positions can never be bridged. The reasons for this will become clear after we have examined a little more precisely what each paper says. The Browman and Goldstein model has been described somewhat briefly before (Browman and Goldstein 1985, 1986); I am delighted to see that it is now becoming more articulated. Browman and Goldstein are giving us a valuable new theoretical framework, and presenting it in a way that is explicit and testable. They suggest that the phonological representation of a word should be partly in terms of a gestural score for articulatory features such as the manner and degree of constriction due to the tongue tip and that due to the tongue body. I want to consider what these tract variables (features?) can do from a linguistic point of view. First of all, are they sufficient? Consider, for example, the status of the jaw. In the current model, the jaw is not an articulatory feature that is part of what they call "the canonical phonological form." So that we can understand the linguistic implications of the Browman and Goldstein view, we will consider whether it would be better if the movements of the jaw were in fact also given by a separate gestural tier in their phonological representation. There are many individual differences in the way that people coordinate jaw and tongue body movements. Some speakers control the position of the tongue essentially by jaw movements when making the vowels in "heed, hid, hayed, head, had." For these speakers the front vowels could be described as having a neutral tongue position and a phonologically determined jaw position. Other speakers have one jaw position for the high vowels in "heed, hid," another for the mid vowels in "hayed, head," and a third for the low vowel in "had," the difference between the so-called tense and lax vowels in each of the pairs being made by advancing the tongue root and thus raising the body of the tongue within the jaw. For these speakers the phonological difference between "hid-head" and between "head-hayed" would involve a feature Tense, realized in some way as 398
PETER LADEFOGED
advancement of the tongue root, and the phonological difference between "hid, head, had" would be described in terms of the jaw height feature. Yet other speakers have more idiosyncratic modes of coordinating the tongue and jaw movements that involve more complex feature specifications. I have not yet examined enough X-ray data to be sure, but it seems likely that the majority of American English speakers move the tongue body somewhat independently of the jaw when making vowels. As my colleague Pat Keating might put it, the jaw has a wide window in making vowels. If jaw location is a real phonological feature, then it appears that, at least for some speakers, it is underspecified for front vowels. This is a somewhat unsatisfactory view, in that it is clear that there is also a group of speakers for whom jaw height interacts with tenseness to specify different phonological classes of vowels in a reliable way. So it would seem that this view of phonology is forcing us into the position that these two groups of speakers have different phonologies; which could be true, but is by no means self-evident on other grounds. Alternatively it might be that the jaw position is not a phonologically specified feature at all, despite the fact that it divides vowels into phonological classes for some speakers. It is the movement of the tongue relative to the upper surface of the vocal tract that is important; the movement of the jaw is, within very wide limits, irrelevant for these sounds. But it is in fact a much needed phonological feature in the case of some other sounds. Virtually everybody has a highly specific jaw movement in sibilants, in which the upper and lower incisors must be close together (Shadle 1985). In these sounds the jaw, or more particularly the lower incisors, should be regarded as an active articulator that must be specified. It is, from an articulatory phonology point of view, the only feature that determines the natural class of sibilant sounds. If we disregard the individual differences among speakers in the production of vowels, we can make an interesting phonological point about a possible difference between vowels and consonants, within a slightly modified version of the Browman and Goldstein model which allows gestures of the jaw to be specified on a separate tier. The phonological specification of the jaw partially determines the tongue body in some sounds - notably sibilants; whereas the reverse is true in that the specification of the tongue body largely determines that of the jaw in other cases - notably vowels. This is the kind of complexity that we do not yet see in the Browman and Goldstein model; but one of the many things that is exciting about this conceptual framework is that it could be built in. There is nothing in their theory that prevents us from arranging different coordinations in vowels and consonants in this way; and it would no doubt be illuminating to do so. There are other interactions between articulatory components of gestures that it would also be advisable to build in, if only to avoid some misconceptions that might otherwise arise. The Browman and Goldstein model tempts one into thinking, as they certainly state, that gestures can be described in terms of separate 399
PETER LADEFOGED
tiers. But it has long been known that, for example, movements of the tip of the tongue affect the body of the tongue as a whole. Stevens et al. (1986) have shown that such interdependences can have interesting phonological consequences. Again, there is no conceptual reason why this interaction should not be included within the model. The present inadequacy in the method of specifying interactions of the tip and body of the tongue is simply a legacy of the articulatory model (Rubin et al. 1981) on which the Browman and Goldstein model is based. It would now seem preferable to consider movements of the tongue, whether of the tip or the body or the root, as being deviations that could affect the entire shape (Ladefoged 1980). Obviously, suggestions such as these for alterations of details of the model are not damaging to the approach that Browman and Goldstein are advocating. They are really indicative of the strength of their approach, in that it can so readily accommodate them. What is more worrying, however, is the question of to what extent it is appropriate to expect a phonology to be expressed in articulatory terms such as theirs. Browman and Goldstein are well aware that their model will not account for auditory aspects of phonology. As if to make up for this major shortcoming they point out that their model is part of an articulatory synthesis system, and that it can make sounds that can be used in perception tests. But they have offered us no formal way of using this information; and it is not at all clear to me that it would be possible to make a formal phonology out of these combined notions. I doubt it is profitable to argue here about what a phonology is. For me it is an explanatory account of the patterns of sound that occur in a language. If you want to hold a more mentalist view than that carefully worded neutral statement, then that is fine and will not affect the rest of my argument. Because the point remains, as Browman and Goldstein note but disregard, phonology has to account for processes that have arisen because speech has to be heard as well as spoken. I am sure they know, but other people may not realize that their proposals cannot account for the vowel raising and lowering in the English vowel shift, or the vowel fronting in the German umlaut, because they do not set up underlying forms and they do not have the features which reflect the required natural classes. They are not concerned with modeling these kinds of phonological processes; they view their task as the formalization of the relation between "the lexical characterization of a word and its characterization in connected speech." But they do not consider abstract representations of words that require transformation through phonological rules such as those involving vowel shift or umlaut. They are making tremendous and very valuable strides towards their goal of characterizing connected speech. But, to make a point to which I will return at the end of this comment, that is not the way to do phonology. There is a principled reason why Browman and Goldstein are bound to fail in any attempt to explain phonological processes such as vowel shift and umlaut, while using tract variables giving the tongue and lip constriction as they do. These 400
On dividing phonetics and phonology
processes arose because of the way listeners pay attention to salient auditory characteristics of sounds. They cannot be explained by rules saying where speakers put their tongues. Vowel raising and lowering involve making systematic changes in the frequency of the first formant. Umlaut requires changes in the other formants while keeping the first formant the same. These changes are simple to state and to incorporate into a phonology, provided one uses features such as High and Low (or a multi-valued Height feature), and Back to refer to acoustic, not articulatory, properties. The Browman and Goldstein vocal tract parameters are not in one-to-one correspondence with the formant frequency changes that underlie these phonological processes. There is no way in which they can be stated in an explanatory way in terms of independent variations of articulatory parameters such as the degree and location of the tongue body constriction. We can make this point explicit by reference to a rule of the form: [- high] ->[ + high]/
X
i.e. mid vowels [e, o] become high vowels [i, u] respectively in some context here designated as X. This rule works fine if High refers to the frequency of the first formant; [e] and [o] are in the middle of the Fl range, and both differ from [i] and [u] in much the same way in acoustic terms. But High cannot be said to refer to either the degree or the location of the tongue body constriction. Le me reinforce this point by putting it another way. How can Browman and Goldstein use their features to divide sounds into appropriate natural classes? Their gestural scores define the location and degree of articulatory constrictions. Consider the natural classes of vowels that result. Figure 22.1 is a plot of the vowels [i, e, 8, ae, a, D, O, U] in terms of the tract variables proposed by Browman and Goldstein applied to the articulatory data reported in Ladefoged (1971:68). All the vowels [a, D, O, U, i] have a fairly small degree of constriction; for the vowels [e, e] the constriction is located at a considerable distance from the glottis, but for the vowel [ae] what constriction there is is comparatively close to the glottis; and for [u] it is about in the middle of the range. There is no way in which any of the resulting groupings can be called natural phonological classes. It might be imagined that the Browman and Goldstein model could be modified so as to state the tract variables in terms of the high-low and back-front location of the body of the tongue. But as can be seen from the articulatory diagrams I have given elsewhere (Ladefoged 1982: figures 1.9 and 1.10), [e] is not in the same region of the tongue height range as [o]; it is closer in height to [u]. The distances between the two pairs are also not the same. The articulatory metric could be warped so as to make it coincide with the acoustic differences. This warping would have to involve High, Low and Back simultaneously, indicating that they are not independent features in articulatory terms, and making it impossible for even a 401
PETER LADEFOGED
40
80
100
120
140
160
location (mm) Figure 22 1 The degree and location (with respect to the glottis) of the tongue body constriction in a set of vowels, based on tracings from single frames of a cineradiology film as shown in Ladefoged (1971).
model of this kind to use the features High and Low to provide an explanatory account of processes such as the English vowel shift rule. Their model, like any articulatory model, cannot, in principle, provide the correct natural classes of vowels because they are the product of auditory/acoustic similarities, not articulatory/physiological properties. Even within the more restricted domain of articulatory features, Browman and Goldstein have problems. They do not have a general notion of vocal tract narrowing. Their phonological specifications of [f, 9, x] would indicate that they have narrow constrictions of the lip, tongue tip, and tongue body, respectively, entailing no formal resemblance in the representations of these sounds, despite the fact that most phonologists would want to regard them as belonging to a natural class. (Browman and Goldstein are not trying to account for historical processes; but it is surely incumbent on anyone trying to set up a system of phonology to provide an adequate way of representing Grimm's Law.) It is also not clear whether their tract variables would allow them to group together sounds within the major classes, such as Sonorant; or whether it would allow them to specify classes such as Strident. The inability of the Browman and Goldstein tract variables to provide transparent definitions of the natural classes required in phonological rules is the articulatory equivalent of the acoustic invariance problem that has been attacked by Stevens and his colleagues (Stevens 1972; Stevens and Blumstein 1975). Some 402
On dividing phonetics and phonology
features, such as Labial, have clear articulatory correlates. Others, such as High, have only a relative invariance; their physiological correlates are relative to those of other features, involving a many-to-many mapping between features and physiological properties. One way out of these problems is to assume that the higher levels of phonological rules are expressed in terms of the conventional features that we all know and love, such as Labial, Coronal, High, Low (although I would prefer a three-valued Height feature), Nasal, Lateral, etc. There would then be a complex many-to-many mapping from specifications in terms of these features to those required for articulatory gestures in the Browman and Goldstein model. But if Bromwan and Goldstein were to take this approach it would mean that they were not really doing articulatory phonology; their work would be more in the realm of articulatory phonetics. Now let me turn to Clements's paper. This paper starts by discussing the tempting but troublesome notion of sonority. It then goes on to formalize a very nice set of observations about syllable types that do not rely on this notion. But they nevertheless do depend on knowing what a syllable is. This is an age-old problem that is not resolved by vaguely presuming that there is some phonetic definition of the feature Syllabic. I long ago gave up trying to define the phonetic properties of a feature of this sort, and decided to refer to a syllable as "an organizing principle" for segments (Ladefoged, 1971, 1985). But now I would prefer to explicitly recognize that the syllable is a phonological notion, further noting that phonology is not necessarily natural (Anderson 1981), and there is no reason to expect that all of its constructs should have simple physical parameters. There is a great gap between phonetics and phonology; but this does not lessen the stature of either of these disciplines. From my point of view, all the major class features described by Chomsky and Halle (1968) are clearly phonological constructs, not definable in terms of articulatory parameters, despite their attempts to do so. I also think that Clements's theory relating what he calls stricture features and segment types merely states a relation between two sets of such constructs. I have no objection to distinguishing segment types such as glide and vowel in terms of features such as Syllabic; nor do I object to defining Syllabic by reference to the distinction between glide and vowel. But I would like to be clear that these very valuable constructs of phonological theory do not have a quantifiable phonetic basis. I take it that Clements agrees in that he states that the feature Syllabic is "properly defined in terms of the hierarchical aspects of syllable structure." I am not quite clear what this cryptic remark means; but it certainly seems that Syllabic is not like his other "stricture features." What is important to our discussion here is whether the notion syllable is common to all languages. Is it an innate part of our phonetic competence that we use in learning to speak any language ? Or are the syllables that we observe simply 403
PETER LADEFOGED
the consequences of language-dependent constraints on segment concatenation ? This is a chicken-and-egg problem, in which we do not know which way round the dependencies go. Like most such problems, it may be better resolved by regarding both items, syllables and segments, as properties definable in terms of a larger unit, such as a word or even a phrase. As a converted addict of American football I have come to believe that when things get tough one should get rid of the ball and punt. So, as a phonetician, I am going to punt the problem to the morphologists, who can, I hope, tell me what a word is. I am not sure if this is true, but I would like to go on believing that languages have words, and that we ought to be stating our universals in terms of word and sentence phonology. I would anticipate that, from a phonological point of view, syllables are language dependent organizing principles, providing structure within a word; and segments are further phonological constructs within syllables. From the point of view of a phonetician concerned with describing the production of speech, words have a complex structure that is somewhat parallel to their phonological structure. Indeed, in many cases it is very similar, which is why phonology may sometimes seem natural, and why there are physical parallels for many phonological phenomena. But it does not follow that the correspondence between phonological and phonetic descriptions should be largely statable in terms of features, the smallest units in each domain. This seems to me to be an improbable assumption that needs to be much more carefully argued by anyone who believes in it. There is, and should be, a gulf between the physiological notions of phonetics and the mentalistic notions of phonology, which is well demonstrated by these two papers. Browman and Goldstein are good phoneticians providing an excellent description of how we may produce utterances. My quibbles are trivial in comparison with the theoretical advances they are making in this field. Clements's work is part of a phonology that is also making great strides. Browman and Goldstein cannot give an explanatory account of the patterns of sounds that occur in a language. Clements cannot say what people are doing when they are talking. Long may they all continue on their separate but equal paths. References Anderson, S. 1981. Why phonology isn't "natural." Linguistic Inquiry 12: 493-539. Browman, C. P. and L. Goldstein. 1985. Dynamic modeling of phonetic structure. In V. Fromkin (ed.) Phonetic Linguistics. New York: Academic Press. 1986. Towards an articulatory phonology. Phonology Yearbook 3. Chomsky, N. and M. Halle. 1968. The Sound Pattern of English. New York: Harper and Row. Ladefoged, P. 1971. Preliminaries to Linguistic Phonetics. Chicago: University of Chicago Press. 1980. What are linguistic sounds made of? Language 56: 485-502. 1985. A Course in Phonetics. New York: Harcourt Brace and Jovanovich. 404
On dividing phonetics and phonology
Rubin, P., T. Baer, and P. Mermelstein. 1981. An articulatory synthesizer for perceptual research. Journal of the Acoustical Society of America 70: 321-328. Shadle, C. H. 1985. The acoustics of fricative consonants. Ph.D. thesis, M.I.T. Stevens, K. 1972. The quantal nature of speech: evidence for articulatory-acoustic data. In P. B. Denes and E. E. David Jr. (eds.) Human Communication. A Unified View. New York: McGraw-Hill. Stevens, K. and S. Blumstein. 1975. Quantal aspects of consonant production and perception: a study of retroflex consonants. Journal of Phonetics 3: 215-233. Stevens, K., S. J. Keyser, and H. Kawasaki. 1986. Toward a phonetic and phonological theory of redundant features. In J. S. Perkell and D. H. Klatt (eds.) Invariance and Variability in Speech Processes. Hillsdale, NJ: Lawrence Erlbaum.
405
23 Articulatory binding JOHN KINGSTON
23.1
The mismatch between phonological and phonetic representations
The division of the stream of speech into a string of discrete segments represents the number of places where contrast between the current speech event and others is possible. Discrete segments may be employed not only to represent phonological contrasts in speech events, however, but also in the plans for actually uttering them. Beyond isomorphism with phonological representations, the appropriateness of discrete segments at some level in the plans speakers employ to produce utterances is demonstrated by the fact that the vast majority of speech errors reorder entire segments rather than parts of segments or features (ShattuckHufnagel and Klatt 1979). This speech error evidence does not, however, reveal whether a segment-sized unit is employed at all levels in the plan for producing an utterance, nor what principles govern the coordination of articulators. Since more than one articulator is moving or otherwise active at any point in time in a speech event, the notion of a segment implies a specification for how these independent movements are coordinated. However, articulatory records of speech events cannot be divided into discrete intervals in which all movement of articulators for each segment begins and ends at the same time, nor can these records be obtained by a simple mapping from discrete underlying phonological segments. The continuous and coarticulatory properties of actual speech events instead make it very unlikely that the plans for producing them consist of nothing more than a string of discrete segments (Ohman 1966, 1967; Fowler 1980; Browman and Goldstein 1985, 1986; but cf. Henke 1966; Keating 1985). Moving from the phonetic to the phonological representation, the discrete phonological segment derived from the commutation of contrasting elements is at best covert. Discreteness has, moreover, gradually disappeared from phonological representations, in two widely-separated steps. The first step broke the segment into a bundle of distinctive features, a move which can be traced back at least to 406
Articulatory binding
Trubetzkoy (1939). Entire segments are still formally discrete, but they are no longer indivisible atoms. This step necessarily preceded breaking up the feature bundles themselves into linked tiers on which the domain of each feature is independently specified (Goldsmith 1976; Clements 1985; Sagey 1986). Discreteness survives in these models only in the timing units of the skeleta (Clements and Keyser 1983), and even the equation of the timing unit represented by the traditional segment with these timing units is presently in question (McCarthy and Prince 1986). Since the domain of feature specification in nonlinear phonological representations is both larger and smaller than the traditional segment, these representations remove any principled phonological reason to expect to find discrete segments in speech events. Discrete segments cannot be found in speech simply because they are not there in the phonological representation any more than in the phonetic one. Giving up the idea of the segment as a discrete element in phonological as well as phonetic representations requires some other principles for coordinating articulations, however, since the evidence for such coordination is patent. The principles of coordination which have been devised so far in nonlinear phonological models are largely formal in nature. They include one-to-one, leftto-right mapping (Goldsmith 1976; Clements and Ford 1979), the projection of P-bearing units (Clements and Sezer 1982), the no-crossing prohibition (Goldsmith 1976; Kenstowicz 1982; Sagey 1986), geminate integrity (Schein and Steriade 1986) or inalterability (Hayes 1986), the obligatory contour principle (Leben 1978, McCarthy 1986), and the shared feature convention (Steriade 1982). Parallel principles for coordinating articulations in the plans speakers employ in producing speech must also be discovered (see Browman and Goldstein 1985, 1986, this volume, for other proposals). This paper presents such a phonetic principle of coordination, which constrains when glottal articulations in consonants occur relative to oral ones. I refer to these constraints as "binding."
23.2 The binding principle The binding principle is intended to account for two fundamental asymmetries in the distribution of glottal articulations (see Maddieson 1984): 1. Stops are much more likely to contrast for glottal articulations than either fricatives or sonorants, and 2. Glottal articulations in stops are much more frequently realized as modifications of the release of the oral closure than of its onset. The principle attempts to account for these asymmetries in terms of two kinds of phonetic differences between stops and continuants. First, both the state of the glottis and degree of constriction downstream may 407
JOHN KINGSTON
affect the pressure of the air inside the oral cavity, and thereby the acoustics of the sound produced. Changes in the size of the glottal aperture and the tension of the folds influence the resistance to air flow through the glottis, and therefore have the control of air flow through the glottis into the oral cavity as their proximate function. In obstruents, the distal function of the changes in glottal resistance which affect glottal air flow is to manipulate intraoral air pressure. The larger the glottal aperture and the lower the fold tension, the more glottal resistance will be reduced, the more air will flow through the glottis, and the more rapidly air pressure will rise in the oral cavity behind the obstruent articulation. This aerodynamic interaction between glottal and supraglottal articulations will be most dramatic in stops since the downstream obstruction of air flow is complete. The amount that intraoral air pressure is elevated behind the stop closure together with the size of the glottal aperture determine the acoustic character of the explosive burst of noise that occurs when the stop is released. Glottal aperture affects the acoustics of the burst both by the aerodynamic means just described and by determining what acoustic coupling there is between supra- and sub-glottal cavities. The acoustic character of the burst therefore at once depends on and cues the state of the glottis. Second, the release of a stop is acoustically distinct from its onset. A burst of noise is produced at the stop release as a result of the sudden opening of the oral cavity. Because the acoustic character of the burst reflects the size of the pressure buildup behind the obstruction (and acoustic coupling to the subglottal cavity), glottal articulations are expected to be coordinated with that part of the stop rather than its onset. No burst occurs at the release of a continuant's articulation because their obstruction of air flow is not complete. The release of a fricative or sonorant will therefore be the acoustic mirror image of its onset. With a less than complete obstruction, variations in glottal air flow will affect intraoral air pressure less, since the more open the articulation, the less air is trapped behind it. The goal of the glottal articulations which accompany continuant, especially approximant, articulations can therefore be only the proximate modification of the source.1 Fricatives are actually intermediate between stops and approximants, since their oral aperture is small enough to obstruct airflow,elevate intraoral air pressure, and accelerate flow through the oral constriction enough to create noisy turbulence. This elevation of intraoral air pressure depends upon a sufficiently large glottal aperture; voiceless fricatives exhibit the widest glottal aperture of any voiceless consonant, and even voiced fricatives are produced with a more open glottis than corresponding voiced stops or sonorants (Hirose and Gay 1972; Hirose, Lisker, and Abramson 1972; Collier, Lisker, Hirose, and Ushijima 1979). These larger glottal apertures would also alter the spectrum of the fricative noise by increasing the acoustic coupling between the supra- and subglottal cavities. How this coupling influences the acoustics and perception of fricatives cannot be gone into 408
Articulatory binding
in this paper, beyond noting that all the acoustic effects of the wide aperture will be distributed across much of the fricative interval. Such distributed acoustic effects of glottal articulations probably characterize continuants in general, in contrast to stops, where they are confined to the moment of release and the short interval imediately following it. In general terms, the acoustic effects of glottal articulations in stops are disjoint from and follow the acoustic effects of the oral articulations, while in continuants they are much more nearly simultaneous. The binding principle claims that a glottal articulation is more constrained more tightly "bound" - in when it occurs the more the oral articulation obstructs the flow of air through the vocal tract: a glottal articulation is most tightly bound to a stop, since air flow is completely obstructed by the stop closure, while the lesser obstruction of a fricative or approximant allows the glottal articulation to shift with respect to the oral one. Since degree of obstruction is neither an articulatory nor acoustic continuum - both the transition from approximation to constriction and from constriction to closure involve crossing of thresholds - the binding principle partitions segments into distinct manners. Usually just two are distinguished: either stops vs. fricatives and sonorants, i.e. noncontinuants vs. continuants, or stops and fricatives vs. sonorants, i.e. obstruents vs. approximants. The dividing line appears to fall most often between stops and continuants, despite fricatives' dependence for noise production on an adequate glottal air flow. The binding principle predicts that the timing of oral articulations determines when glottal articulations will occur and that timing depends on the continuancy of the oral articulation. The onsets and offsets of glottal articulations will be more or less synchronized with the onsets and offsets of oral articulations in continuants, but not in stops. In the latter, they will align with the release so as to modify it and the immediately following interval acoustically. Alignment of glottal articulations in stops with the release is apparent in voice onset time (VOT) contrasts, in which adduction of the glottis is timed with respect to the release of the stop. Glottal articulations are also aligned with the stop release in at least two other kinds of stops, breathy voiced stops and ejectives, which do not participate in VOT contrasts. Partial abduction begins with or just before the release of the stop in breathy voiced stops and tight glottal closure combines with reduction of oral volume to produce an intense burst at the release of an ejective. This alignment again has the effect of concentrating the acoustic effects of the glottal articulations in the immediate vicinity of the release. The next section of this paper addresses the problems that pre-aspirated stops of the sort found in Icelandic present to the claim that glottal articulations bind to a stop's release rather than its closure. It is shown that both in the phonology and phonetics of Icelandic, pre-aspirates are not after all a problem for this aspect of the binding principle. The final part of the paper explores a number of further problems with the binding principle and proposes tentative resolutions of them. 409
JOHN KINGSTON
23.3 The problem of pre-aspirates 23.3.1 23.3.1.1
Introduction
THE ORAL-GLOTTAL SCHEDULE IN PRE-ASPIRATED STOPS
Icelandic has a class of stops which are traditionally called " pre-aspirated," since the glottis opens before the oral closure is made, producing a preceding interval of noise. Pre-aspirates are markedly rarer in the world's languages, at least as a contrasting type, than the common post-aspirates. Contrary to the prediction of the binding principle, this early abduction of the glottis may be bound to the onset of the oral closure, in contrast to the post-aspirated stops in which abduction is bound to the release of the closure. Since Icelandic also has post-aspirated stops, a contrast appears to exist in this language between stops whose glottal abduction binds to the stop closure and those where it binds to the release. In Icelandic pre-aspirated stops (Petursson 1972, 1976; Thrainsson 1978; Lofqvist and Yoshioka 1981b; Chasaide p.c), the glottis begins to open during the preceding vowel, partially devoicing it. Pre-aspiration generally has the timbre of that vowel. There is some uncertainty about when peak glottal opening is reached and when the folds are again adducted, however. Lofqvist and Yoshioka's data indicate a small glottal opening that peaks before the stop closure, with adduction coinciding with the beginning of the stop closure, while both Petursson's and Chasaide's data show a much larger peak opening that is aligned to the closure, with adduction not occurring until the stop is released or even somewhat after. Pre-aspirated stops in Lofqvist and Yoshioka's data invert the order of glottal and oral articulations found in breathy voiced stops such as occur in Hindi, in which the glottal opening (largely) follows the release of the oral closure (Kagaya and Hirose 1975; Benguerel and Bhatia 1980), while the pre-aspirates observed by Petursson and Chasaide are the mirror image - relative to the stop closure and release - of post-aspirates, in which the glottis begins to open at the onset of the oral closure, reaches peak opening at release, and is only adducted noticeably later. In either case, the timing of glottal abduction relative to the oral closure in preaspirates contrasts with that found in another stop type in which the opening of the glottis is likely to be bound to the release of the closure, either breathy voiced or post-aspirated stops.
23.3.1.2
THE PHONOLOGY OF PRE-ASPIRATION IN ICELANDIC
Pre-aspiration in Icelandic (Games 1976; Thrainsson 1978), and perhaps in the other languages in which it is found, arises only from underlying clusters and not from single segments. On the surface, pre-aspirated stops contrast with unaspirated geminate or single stops between vowels and with post-aspirated stops, but before nasals or laterals the contrast is only between pre-aspirates and 410
Articulatory binding
single unaspirated stops. The syllable boundary falls after the stop in the latter sort of clusters. Even when they occur alone, Icelandic pre-aspirates all retain one quite overt property of surface heterosyllabic clusters: only short vowels may precede them, which also indicates that they close the preceding syllable. On the other hand, vowels are long before post-aspirates in this language, whether occurring alone or followed by a glide or rhotic. Post-aspirates and clusters beginning with them must therefore belong only to the following syllable. In all dialects, pre-aspirates arise from underlying geminate aspirates, but the sources of pre-aspiration or a close analogue of it are by no means restricted to those. In Southern Icelandic dialects, underlying clusters of a voiced fricative, nasal, lateral, or rhotic preceding an underlying aspirated stop are typically realized with the first segment voiceless and the second unaspirated. In Northern dialects, a more restricted set of consonants undergoes devoicing before underlying aspirates. In both sets of dialects, when the first consonant devoices, the second is underlyingly aspirated, i.e. [ + spread], and when the first consonant remains voiced on the surface, the following stop is invariably post-aspirated. Since devoicing of the first segment co-occurs with an obligatory absence of postaspiration on the stop, it represents the shift of the [ + spread] specification to the first segment from the stop. Furthermore, since the stop is post-aspirated only when the first segment in the cluster does not devoice, it is possible to claim that the one or the other but not both segments in the cluster may be realized with [ +spread] glottis. In a parallel way in most dialects, the first stop in heterorganic clusters of stops where the second is underlyingly aspirated is realized as the corresponding voiceless fricative, and again, if the first stop spirantizes, the second may not be post-aspirated. Since voiceless fricatives demand a very wide glottal aperture to elevate air flow through the glottis sufficiently to produce turbulence downstream through the oral constriction, this realization can also be seen as a shift of [ + spread] to the first element of the cluster. Finally, as in many other languages, Icelandic stops do not contrast for aspiration after / s / , a segment which demands at least as wide a glottal aperture as the other voiceless fricatives. In general terms, not only may just one segment bear a [ + spread] specification in surface forms, but only one segment in the cluster needs to (or can) be specified as [ + spread] in underlying forms. In all these cases, underlying specification of the second member of the cluster as [ +spread] is phonetically realized as a property of the first rather than the second element of the cluster by a quite general procedure in the language. In all dialects, clusters in which the final stop is underlyingly [ + spread] contrast others in which it is not. In the latter sort of cluster, the preceding consonant remains voiced uniformly. Since it arises only in clusters whose second member is [ + spread], preaspiration cannot contrast directly with post-aspiration. The latter arises only when an underlying [ + spread] stop occurs in absolute initial position in a word or 411
JOHN KINGSTON
directly after a vowel. Generalizing, pre-aspiration arises in syllable codas in heterosyllabic clusters whose second member is [ + spread] as a result of a shift of this specification to the preceding coda. On the other hand, post-aspiration is only found in singleton [ + spread] stops or those which are the first member of a tautosyllabic cluster, i.e. in onsets rather than codas. Formulaically, /aC.p h a/ is realized as [aCh.pa], while /aa.ph(C)a/ undergoes no change. The failure of preaspirates to ever contrast directly with post-aspirates within morphemes in Icelandic eliminates them as a problem for the binding principle, at least insofar as the phonology of the language is concerned. (The analysis outlined above is a generalization of Thrainsson's 1978. I am indebted to Donca Steriade for making the significance of devoicing in the initial segments of clusters ending in [ + spread] stops clear to me.) 23.3.2 Pre-aspirates as a test of the binding principle in the phonetic component: the covarying durations of glottal abduction and flanking oral articulations
23.3.2.1
INTRODUCTION
An experiment was designed to determine whether the abduction of the glottis is still after all part of the same articulatory unit as the following stop closure in preaspirated stops, since there is no a priori reason why the units of articulation should be identical to underlying ones. If pre-aspiration is coordinated with the following stop rather than the preceding vowel, then glottal articulations must be allowed to bind phonetically to the closure as well as the release in stops, despite the absence of any event as salient as the release burst at the beginning of the closure. This experiment assumes that if two articulations are coordinated with one another, their individual durations will covary across global changes in segment duration, due to changes in rate or prosodic context, i.e. coordinated articulations should exhibit relational invariance (Tuller and Kelso, 1984). Specifically, if abducation of the glottis is coordinated with an adjacent or overlapping oral articulation, then the duration of abduction should covary with that of the oral articulation. If the binding principle is correct, then the duration of an abduction which overlaps the release of a stop closure, as in post-aspirated stops, should correlate positively with the duration of the closure, but no necessary correlation should be observed when abduction does not overlap the release, as in preaspirated stops. Pre-aspiration may instead be coordinated with the preceding vowel and its duration might covary with the duration of that segment. Alternatively, glottal abduction in pre-aspirated stops may be coordinated with no oral articulation. The binding principle does not distinguish between the two possibilities. 412
Articulatory binding
23.3.2.2 METHODS A single female speaker of Icelandic was recorded producing words containing medial pre- and post-aspirated stops under conditions that would change segment durations. Since the binding principle predicts coordination of glottal abduction with the preceding stop, specifically its burst, in post-aspirated stops, covariation of the durations of oral and glottal articulations in them can be compared with that in pre-aspirated stops. Medial breathy-voiced and post-aspirated stops were also recorded from a single female speaker of Hindi under similar conditions. Since abduction overlaps or follows the stop release in both breathy voiced and postaspirated stops, these Hindi data provide an additional test of the positive prediction of the binding principle that abduction will be coordinated with a release burst that it overlaps temporally. All tokens in both languages were produced medially in short frame sentences, with the sentences read in a different, random order for each repetition. The Icelandic speaker produced a variety of real words containing pre- and postaspirated stops between the first, stressed syllable and the second. (Lexical stress occurs obligatorily on the first syllable in Icelandic words.) The Hindi speaker produced nonsense words of the shape [a a] or [a apa] with the stops in the blanks. In alternate readings, the Icelandic speaker put focus on the test word itself or on the immediately preceding word. Segments are expected to be longer in words in focus than in those not in focus, particularly in the vicinity of the stressed syllable. Final lengthening would lead one to expect longer segments in the Hindi words where only one syllable follows the stop than when two do. Both of these manipulations, though they are quite different from one another, should therefore affect the duration of the stops, and their component articulations. In addition, for both languages, each way of reading the words was produced at self-selected moderate and fast rates. This orthogonal variation in rate will also affect segment duration, though more globally than focus or the number of following syllables. Multiple repetitions of each utterance type were collected from each speaker under each condition. The utterances were digitized at 10 kHz and measurements made from the waveform displays, rather than articulatory records, to the closest millisecond. Three intervals were measured in each case: 1. The duration of audible glottal abduction, 2. The duration of the flanking vowel plus the duration of audible glottal abduction, and 3. The duration of the flanking oral closure plus the duration of audible glottal abduction. The interval of glottal abduction was taken to be the stretch of noise between the
JOHN KINGSTON
vowel and oral closure, except for pre-aspiration in Icelandic, where this interval included a breathy interval at the end of the preceding vowel as well as a following interval of noise before the stop closure. This interval is that portion of the actual abduction of the glottis which is audible; since in all cases theflankingstop closure overlaps some part of the abduction, the interval measured is only part of the total duration of the abductory gesture. The vowel was identified as that stretch where the waveform was intense, clearly periodic, and with the asymmetric periods characteristic of modal voice. The oral closure was identified as that interval of greatly reduced amplitude or silence between vowels. The question immediately arises whether acoustic records can be used in this way to measure the duration of articulatory events. This question is especially pertinent here since that part of glottal abduction which overlaps with the oral closure cannot be observed directly in an acoustic record. However, I have assumed that changes in the duration of the observable part of the glottal abduction outside the interval of the oral closure will be in proportion to changes in its total duration. If this is true, measuring the observable part would reveal the same relative changes as measuring the entire interval of abduction. 23.3.2.3
ANALYSIS
Correlations were calculated between the duration of audible glottal abduction and the acoustic durations of each of the flanking oral articulations combined with the duration of audible glottal abduction for each of the two stop types in the two languages. Correlations tend to be positive, often quite strongly so, when an interval is correlated with a larger interval which includes itself, but the portion of the obtained correlation which is due simply to this inclusion can be calculated and the significance of the difference between these obtained and expected correlations determined (Cohen and Cohen 1983; Munhall 1985). If the obtained correlation is significantly larger than the expected value, then the durations of the two articulations can be assumed to be positively correlated, while if the obtained correlation is significantly smaller, then the two intervals are negatively correlated. Correlations that are more positive than expected are taken here to indicate coordination between the two articulations. These part-whole correlations avoid the negative bias which arises from errors locating the boundary when correlating the durations of adjacent intervals (Ohala and Lyberg 1976). 23.3.2.4
RESULTS
The obtained and expected correlations with indications of significant differences, if any, pooled across conditions (as plotted in the figures) are listed in table 23.1 below, for the two different rates in table 23.2, and for the two different focus positions (Icelandic) or two different numbers of following syllables (Hindi) in table 23.3: 414
Articulatory binding Table 23.1 Part—whole correlation between glottal abduction and adjacent closure or vowel duration: observed (expected), all data pooled Icelandic
Hindi
Pre-aspirated n = 187 AxC 0-800 > (0-598) AxV 0-917 > (0-848)
Post-aspirated
Breathy voice
Post-aspirated
78 0-616 = (0-619) 0-728 > (0-381)
48 0-827 == (0-735) 0-383 == (0-572)
48 0-797 > (0-634) 0-670 = (0-708)
n is the number of tokens of each type; different «'s given for all the tokens of each stop type pooled for each condition. "AxC" indicates correlation between abduction and adjacent closure, "AxV" between abduction and adjacent vowel. " > " indicates the observed correlation is significantly (p < 005 or better) positive compared to the expected value, " < " that it is significantly negative, and " = " that there is no significant difference. The same conventions are used in all the other tables. Table 23.2 Part-whole correlation between glottal abduction and adjacent closure or vowel duration: observed (expected), condition: rate Icelandic
Slow: n = AxC AxV Fast: n = AxC AxV
Hindi
Pre-aspirated
Post-aspirated
Breathy voice
Post-aspirated
94 0-674 = (0-561) 0-891 > (0-806)
39 0-554 = (0-663) 0-762 > (0-394)
24 0-770 = (0-825) 0-068 < (0-569)
24 0-851 =(0-833) 0-684 = (0-763)
93 0-705 = (0-684) 0-855 = (0-812)
39 0-678 = (0-748) 0-632 = (0-544)
24 0-817 = (0-810) 0-337 = (0-543)
24 0-760 = (0-746) 0-521 = (0-719)
Table 23.3 Part-whole correlation between glottal abduction and adjacent closure or vowel duration: observed (expected), condition: focus (Icelandic) and number of syllables (Hindi) Icelandic
Hindi
Pre-aspirated Focus: In focus: n = 93 AxC 0-821 > AxV 0-942 = Not in focus: n = 94 AxC 0-710 > AxV 0-871 =
Post-aspirated
(0-627) (0-921)
40 0-574 = (0-597) 0-696 > (0376)
(0-544) (0-827)
38 0-641 = (0-692) 0-741 > (0-396)
Breathy voice Number of syllables: a_a: 24 0-821 = (0-652) 0-301 = (0-424) a_apa: 24 0-845 = (0-840) 0-784 > (0-531)
Post-aspirated
24 0-850 = (0-689) 0-745 = (0-747) 24 0-730 = (0-658) 0-732 = (0-854)
JOHN KINGSTON
23.3.2.5
DISCUSSION
The binding principle predicts no positive correlation between the durations of glottal abduction and the following stop closure in Icelandic pre-aspirates but just such a positive correlation between the durations of abduction and the preceding closure for the other three stop types: Hindi breathy voiced and Hindi and Icelandic post-aspirated stops. Nearly complementary predictions are made regarding correlation between the durations of abduction and the adjacent vowel; no positive correlation is expected for stop types other than the Icelandic preaspirates, which may exhibit such a correlation as a result of coordination between the opening of the glottis and the articulation of the preceding vowel. Alternatively, abduction of the glottis in Icelandic pre-aspirates may not be coordinated with either of the two flanking oral articulations. The binding principle does not require that glottal articulations be coordinated with oral ones, but instead predicts how they will be coordinated if they are. As table 23.1 shows, for all the data pooled, correlations between pre-aspiration and bothflankingoral articulations were significantly more positive than expected. The observed correlation of 0-800 between abduction and the following closure is significantly more positive than the expected 0*598, and the observed correlation between abduction and the preceding vowel is also significantly more positive than expected: 0*917 vs. 0*848. Significantly positive correlations are also observed between abduction and the following vowel in Icelandic post-aspirated stops, but not between abduction and the preceding closure. In post-aspirated stops in Hindi, however, the correlation between abduction and the preceding closure is significantly more positive than expected. The Hindi breathy voiced stops exhibit no significant correlation between abduction and either the preceding closure or the following vowel. Even so, the difference between the observed and expected correlations for abduction with preceding closure in Hindi breathy voiced stops is in the same direction as for the Hindi post-aspirated stops. The correlations between abduction and the following vowel, on the other hand, are in the opposite, negative direction, though for neither type of stop are they significant. Tables 23.2 and 23.3 show that, with one exception, the difference between observed and expected correlations observed in each condition separately do not differ in direction from those obtained when all the data were pooled. The exception is the significantly positive correlation between abduction and the following vowel in Hindi breathy voiced stops when followed by two syllables (table 23.3), which is otherwise consistently negative with respect to the expected correlation. The same pattern of significant differences was not always found, however, in each of the conditions as when all the data were pooled. Though in the same direction as in other conditions, no significant differences between observed and expected correlations were obtained for any of the stops at the fast rate; elsewhere, however, at least some of the differences are significant. One 416
Articulatory binding
consistent and troublesome result is that no significant difference in either direction was found between abduction and the preceding closure for Icelandic post-aspirated stops in any of the experimental conditions. This failure is especially worrisome since post-aspirated stops in Icelandic do show significantly positive correlations between abduction and the following vowel in three out of four conditions, and in the fourth the difference between observed and expected correlations is in the same direction. Also, the near absence of significantly positive correlations for both of the Hindi stops in any of the conditions reveals generally weak covariation between the duration of abduction and that of either of the flanking oral articulations in this language. Finally, the significantly positive correlation between abduction and the preceding closure obtained for the Hindi post-aspirated stops when all the data were pooled is not found in any of the conditions taken separately. If all correlations which differ from the expected value in one direction are considered, not just those which are significant, then the binding principle fails in two ways in Icelandic: 1. Pre-aspiration correlates positively with the following closure, but 2. Post-aspiration does not correlate positively with the preceding closure. This pattern of results is exactly the opposite of what was predicted by the binding principle. For both pre- and post-aspirated stops, abduction correlates positively with the flanking vowel, which is neither predicted nor excluded by the binding principle. The Hindi data weakly support the binding principle, on the other hand; both breathy voice and post-aspiration correlate positively with the preceding closure but not with the following vowel. What these data reveal is an essential difference between the two languages, rather than between stops which differ in the order of the glottal and oral articulations. By this measure in Icelandic, glottal abduction appears to be coordinated with the adjacent vowel and for pre-aspiration also with the following oral closure, while in Hindi, glottal abduction is coordinated with the preceding oral closure but not the following vowel. The relatively small number of significant correlations suggests one of two things: either this method of evaluating coordination is too coarse, perhaps because abduction is only measured partially, or that coordination between glottal and oral articulations, where it does exist, does not exhibit strong relational invariance. The most troubling results are the double failure of the binding principle in Icelandic: the absence of the predicted positive correlation of abduction with the preceding closure in post-aspirated stops and the presence of the excluded positive correlation of abduction with the following closure in the preaspirated stops. In the next section, an alternative explanation will be offered for the latter problematic result. The failure of the binding principle with respect to the 417
JOHN KINGSTON
Icelandic post-aspirates remains genuine, however, unless it can be shown that relational in variance is not an appropriate measure of coordination. 23.3.3
Breathiness and noise: the two components of pre-aspiration in Icelandic 23.3.3.1
INTRODUCTION AND METHODS
A closer look at pre-aspiration reveals that the coordination with the following oral closure is illusory. As illustrated by the waveform of the transition from a vowel to a pre-aspirated stop (in the word pykkur "thick") in figure 23.1a, preaspiration has two components: a breathy interval followed by an interval of noise. The breathy interval was identified as that where the shape of a glottal period changed from asymmetric to sinusoidal, while noise was simply that interval where the waveform no longer appeared periodic; both intervals could be identified consistently by eye. Breathy voicing is a product of partial abduction of the folds, which increases air flow through the glottis sufficiently to produce noise and also increases the negative slope of the spectrum of the glottal wave. Compared to modal voice, in which the higher harmonics are more intense than the first, in breathy voice, a shortening of the closed phase of the glottal cycle relative to the open phase produces a source spectrum in which the first harmonic is more intense than higher harmonics (Bickley 1982; Ladefoged 1982). The more intense first harmonic produces the characteristic nearly sinusoidal shape of the waveform. This initial partial abduction is not great enough to extinguish voicing, however. That only happens somewhat later, when the vocal folds become too far apart to vibrate any longer and only noise is produced. Vocal fold vibration thus ceases substantially before the beginning of the following oral closure. Of the two components of pre-aspiration, Chasaide (1986) has argued that it is the breathy interval and not the noise which is the most salient cue to identifying a stop as preaspirated in Icelandic and Scots Gaelic. The waveform in figure 23.1a is for a token in focus spoken at a moderate rate, i.e. under conditions when segment durations are expected to be longest. Figure 23.1b is the waveform for an out-of-focus token of the same word spoken quickly, i.e. one in which segment durations are expected to be shortest. This comparison shows that the duration of the noisy component is reduced much more drastically than the duration of the breathy component, a consistent feature of these data. The difference between these two waveforms is in fact characteristic of the effects of varying rate or focus location; the breathy component remains essentially inert while the noisy component changes in proportion to changes in other segment durations. Changes in the durations of the two components of pre-aspiration were examined quantitatively in the data collected from the Icelandic speaker and the changes in their individual durations compared to changes in the total duration of pre-aspiration, the duration of the preceding vowel, and the following stop closure, again via part-whole correlations. 418
Articulatory binding
closure
I 40
I
20
I 60
80
100
120
Time (ms) (a)
closure
20
I 60
I 40
80
Time (ms) (b) Figure 23.1 Waveforms of transition from the vowel to a following pre-aspirated stop in pykkur "thick": (a) when spoken in focus at a slow rate, and (b) out of focus at a fast rate. Note the marked reduction of the duration of the noisy component, compared to the inertness of the breathy component, from (a) to (b).
419
JOHN KINGSTON
23.3.3.2 RESULTS As noted, the breathy component varies in duration much less than does the noisy component, across the experimental conditions. The range of variation for the breathy component is only 40 ms. (10-50 ms.), while the noisy component varies across a range of more than 90 ms. (10-100 ms.). The noisy component itself varies in the expected direction, being markedly shorter at fast than slow rates and when the word is not in focus than when it is. The results of correlating the durations of the breathy and noisy components to each of these intervals are compared with expected values in table 23.4—23.6. These are again part-whole correlations. The duration of the breathy component correlates negatively with the total duration of pre-aspiration (table 23.4) when all the data are pooled and when correlations are calculated for each condition separately except in the out of focus condition. The observed correlation for the noisy component, on the other hand, is significantly positive overall, and in the two conditions where longer durations are; expected (nonsignificant correlations differ in the same direction for both components). The correlation between the breathy component and the preceding vowel (table 23.5) is also significantly negative overall and at both rates. The observed correlation is less than expected in both focus conditions, too, even though these differences are not significant. The correlation between the noisy component and the preceding vowel is not significantly different from the expected value overall, or in any condition. Finally, the breathy component exhibits a significant negative correlation with the following stop closure (table 23.6) overall and at the fast rate, but in the out of focus condition the correlation is significantly more positive than expected (nonsignificant correlations are in the negative direction). The correlation of the noisy component is significantly positive overall and when the word was in focus; otherwise, the observed correlations are not
Table 23.4 Part-whole correlations between breathy and noisy components of pre-aspiration and total duration of pre-aspiration: observed {expected)
Overall: Rate: Slow: Fast: Focus: Yes:
No:
Breathiness
Noise
n =
188
-0-202 < (0-247)
0-929 > (0-878)
n = n =
94 94
-0-471 < (0-298) -0-108 < (0-318)
0-922 > (0-827) 0-884 = (0-855)
n = n =
93 95
-0-018 < (0-217) 0-339 = (0-365)
0-963 > (0-900) 0-780 = (0-721)
420
Articulatory binding Table 23.5 Part-whole correlations between breathy and noisy components of pre-aspiration and duration of preceding vowel: observed (expected)
Overall: Rate: Slow: Fast: Focus: Yes: No:
Breathiness
Noise
n =
188
- 0 0 0 4 < (0-370)
0-915 = (0-929)
n = n =
93 95
-0-230 < (0-404) 0-082 < (0-423)
0-898 = (0-915) 0-885 = (0-906)
n = n =
94 94
0-127 = (0-273) 0-446 = (0-487)
0-958 = (0-962) 0-861 = (0-874)
Table 23.6 Part—whole correlations between breathy and noisy components of pre-aspiration and duration of following closure: observed (expected)
Overall: Rate: Slow: Fast: Focus: Yes: No
Breathiness
Noise
n=
187
0-047 < (0-215)
0-706 > (0-614)
n= n=
94 93
0-077 = (0-251) 0-000 < (0-299)
0-579 = (0-605) 0-624 = (0-669)
n= n=
93 94
0-087 = (0-147) 0-459 > (0-240)
0-796 > (0-618) 0-527 = (0-432)
significantly different from the expected values (here, no consistent direction is evident in the nonsignificant correlations). 23.3.3.3
DISCUSSION
Since the part-whole correlations between the noisy component and preaspiration were in some cases significantly positive, the noisy component can be identified as the single variable component of the measured duration of preaspiration. On the other hand, the breathy component must be independent of pre-aspiration, since its duration correlates negatively with pre-aspirations. With respect to the flanking oral articulations, a similar pattern emerges; the breathy component tends to correlate negatively with each of them, while the noisy component either exhibits no correlation or a positive correlation. More generally, the duration of the breathy component stays the same or increases slightly as the other articulations get shorter, while the duration of the noisy component varies proportionally in the same direction as the other articulations. These facts allow a different interpretation of the apparent positive correlation between the duration 421
JOHN KINGSTON
Glottal
Oral Open
Closure
A
( )
Time
Earlier closure truncates noise
Time Figure 23.2 Presumed timing of oral and glottal articulations in an Icelandic preaspirated step: (a) the breathy component is produced by a small glottal abduction, the noisy component by a much larger abduction; (b) increasing speaking rate or being out of focus leads to earlier oral closure, and hence greater overlap of the oral closure with the interval of large glottal abduction.
of pre-aspiration, and that of the following closure, which removes the problem these Icelandic data presented to the binding principle. Figure 23.2a represents the timing of the oral and glottal articulations which will produce the waveform in figure 23.1a (cf. Petursson 1976). As sketched in figure 23.2b, at faster rates or when the word is not in focus (cf. figure 23.1b), the interval between the two successive oral articulations is shortened; the closure occurs 422
Articulatory binding
earlier than at slower rates or in focus. This earlier closure overlaps more of the interval of wide glottal abduction which produces the noisy component of preaspiration and thereby shortens the interval during which it is audible. Since this noisy interval is the principal variable component of pre-aspiration, this rate- or focus-induced variation in the amount of overlap produces the observed positive correlation between pre-aspiration and the following closure. But, since the oral articulation is simply sliding with respect to the glottal one, that positive correlation does not indicate coordination between the two after all. Instead the timing of the glottal articulation may remain more or less unchanged as the interval between the oral articulations varies, with the closure overlapping more or less of the interval of wide glottal aperture which produces the noisy component of pre-aspiration. Slight lengthening of the breathy component at the same time that the noisy component is being shortened indicates, however, that the schedule of the glottal articulation is not entirely invariant. At faster rates or out of focus, the small abduction which produces breathiness occurs earlier with respect to the end of the vowel articulation and the transition to the wider abduction which produces noise alone takes longer. What is significant here is that the timing of the glottal articulation is being adjusted with respect to the preceding vowel, though it apparently is not adjusted with respect to the following consonant. The complementary variation between the two components of pre-aspiration suggests that they may trade off perceptually in conveying the fact that the stop is pre-aspirated. With the longer breathy interval, pre-aspiration is anticipated to greater extent in shorter vowels, perhaps to ensure that some minimum interval of non-modal phonation occurs, while with longer vowels, the upcoming oral closure is also later, so a longer noisy component will occur. These data from production thus do not conflict entirely with Chasaide's claim that breathiness is more important than noise for identifying a pre-aspirated stop. They instead suggest that the relative importance of the two components, as measured by their relative durations, varies in a complementary way with the acoustic duration of flanking oral articulations. This account of Icelandic pre-aspirates supports the suggestion of Browman and Goldstein (this volume) that apparent assimilations and deletions which occur in casual speech arise simply from an acoustically obscuring overlap of the movement of one articulator - the apparently assimilated or omitted articulation - by the sliding movement of another over it, rather than being a product of actually substituting or omitting articulatory gestures. In their view, the shift from a careful to a casual speech style changes the relative timing of articulatory gestures but not their shape or occurrence. It is not at all surprising that this suggestion should generalize to the effects of varying rate and focus location in these Icelandic data, since the differences between casual and careful speech undoubtedly parallel those between fast and moderate rates of speaking or between a word out of focus and one in focus. 423
JOHN KINGSTON
23.3.4 Summary The Icelandic data appeared at first to cast serious doubt on the binding principle, in two ways. First, pre-aspiration correlated positively with the following stop closure, and, second, post-aspiration did not correlate with the preceding stop closure. In both kinds of stops, furthermore, the duration of audible abduction of the glottis correlated positively with the adjacent vowel. However, as closer examination of the variation of duration in the two components of pre-aspiration has shown, there is no actual coordination between glottal abduction and the following oral closure in Icelandic pre-aspirates, at least as reflected in relational invariance. Instead, the relative times of the two flanking oral articulations vary with respect to one another, occurring closer together at faster rates or out of focus. This variation changes the duration of the audible interval of noise in preaspiration without affecting the duration of the breathy component proportionally because the oral closure overlaps more or less of the interval of the wide glottal abduction that produces the noisy component. Since the noisy component is the principal variable component of pre-aspiration, this variation in the amount of overlap of glottal abduction by the oral closure leads to an apparent covariation between the durations of pre-aspiration and the following oral closure, suggesting coordination of the two articulations. The more correct view is that the glottal articulation is not coordinated at all with the following oral closure. Its coordination with the preceding vowel, on the other hand, tends towards invariance, though there is evidence of a small deceleration of glottal abduction at fast rates and out of focus. The Icelandic pre-aspirates do not therefore pose a problem for the binding principle in the phonetics any more than in the phonology. The Hindi data turned out much closer to the predictions of the binding principle in that in both breathy voiced and post-aspirated stops, glottal abduction only correlates positively with the preceding stop closure. The post-aspirates of Icelandic remain problematic, since they failed to show even the slightest sign of covariation in duration between glottal abduction and the preceding oral closure of the sort predicted by the binding principle.
23.4 Binding sites: predictions of the binding principle and problems The single ballistic opening of the glottis which occurs in the production of single voiceless consonants could bind either to the beginning or the end of the oral constriction. In fricatives, the beginning and end of the oral constriction are acoustically more symmetric - and the glottal abduction is simply intended to produce a sufficient and continuous flow of air - so the binding principle can chose one or the other as the favored binding site. In stops, on the other hand, these sites are asymmetric: the dramatic reduction in amplitude at the stop closure is quite 424
Articulatory binding
different from the burst of noise at the release, and the binding principle predicts that the opening of the glottis binds to the release, because it is the acoustic character of the burst which the glottal articulation is intended to modify. Observations of the timing of glottal abduction in single consonants and in clusters appear to indicate, however, that glottal articulations are not always timed with respect to oral ones in ways predicted by the binding principle - stops show coordination of glottal articulations with the closure as well as the release and fricatives exhibit more restrictive timing of glottal articulations than expected: 1. Glottal abduction is coordinated with the closure rather than the release of the stop in voiceless unaspirated stops, and with the beginning of the constriction in voiceless fricatives; 2. As a corollary, the binding principle predicts that glottal articulations will bind to the stop in a cluster of a stop and a fricative (regardless of their order) and to the last stop in a cluster of stops, because typically only the last is audibly released. However, the manner of the first rather than the last obstruent in a cluster, i.e. whether it is a continuant, generally determines when the peak glottal opening will occur, despite that first segment's frequent lack of an audible release (it may lack a burst at its release either because it is a fricative or because it is a nonfinal stop). In some languages, furthermore, peak opening velocity occurs at the same time relative to the onset of the cluster articulation for clusters beginning with stops as well as fricatives, implying that the closure rather than the release is the general binding point for glottal articulations in consonants of both manners in clusters; and 3. Closure durations in aspirated stops vary inversely across languages with the duration of aspiration, implying that the timing of glottal abduction is relatively invariant, instead of shifting to stay close to the oral release. The second and third problems are elaborated below; see Louis Goldstein's commentary for much more extensive discussion of aspects of the first two problems. Unsurprisingly, consideration of the speaker's acoustic goals resolves the first of the problems for the binding principle. Binding the peak opening in unaspirated stops to the onset of the oral closure yields a reliable contrast in how their bursts sound compared with those of aspirated stops. Since the glottis is adducted by the time the stop is released in unaspirated stops, abduction serves only to ensure an absence of vocal fold vibration during the closure and will not positively modify the acoustics of the burst and what follows. The weak burst of unaspirated stops may therefore be acoustically neutral, i.e. phonetically underspecified and lacking in any unique acoustic signature, in contrast to the positively modified burst of aspirated and some other kinds of stops (such as ejectives) in which the glottal articulation is coordinated with the release. Unaspirated stops cannot, of course, be treated as underspecified in articulation, since their successful production is 425
JOHN KINGSTON
only possible if the abduction of the glottis is scheduled so as to be complete at or before the stop is released. If the burst is acoustically neutral in unaspirated stops, however, then the binding principle, or a restricted form of it, would not constrain when the glottal articulation occurs. This restricted form of the binding principle applies only to those stop types in which the timing of the glottal articulation is controlled so as to positively alter the acoustics of the burst. There is some further evidence that glottal timing is acoustically less crucial in unaspirated stops. Flege (1982) has shown when the glottis is deducted in the voiceless unaspirated realizations of utterance initial [ + voice] stops in English does not predict the onset of vocal fold vibration. Some speakers who occasionally or always produce voiceless unaspirated stops in this position always adduct the folds substantially before the stop release, but, nonetheless, there is no voicing until the stop is released. Others, who never adduct the folds until the stop release, always produce voiceless unaspirated allophones. This second pattern of synchronization of glottal adduction with the stop release is apparently the more typical one for voiceless unaspirated stops in other languages. (Voiceless unaspirated stops are typically the sole allophone of [ — voice] stops in these languages.) In none of these languages is there any variation in the timing of adduction in their unaspirated stops. The differences in the phonological role of unaspirated stops and in the timing of the articulations which underlie them, among English speakers and between English and other languages, have no consequences for the acoustics of the burst, since the glottis is adducted in all of them at the stop release. The markedly different aerodynamic requirements of fricatives, especially sibilants, govern a different schedule of glottal and oral articulations than in stops. Both the early peak glottal opening and its large size in voiceless fricatives yields the high glottal air flow and hence high oral air flow needed to produce noisy turbulence. Noise also characterizes aspirated stops, but unlike fricatives this noise occurs late in the articulation of the segment rather than early.2 These aerodynamic requirements downstream thus demand a different, earlier timing of glottal abduction relative to the oral constriction in fricatives than stops. Clusters of voiceless obstruents present a picture quite like that of single segments. Lb'fqvist and Yoshioka (1981a) compared the size and velocity of the glottal opening in single obstruents and obstruent clusters in which just a single opening occurred for three languages, Swedish, Japanese,3 and Icelandic. In all three, peak opening was earlier in fricatives than stops, and in clusters beginning with fricatives than stops. English is similar (Yoshioka, Lofqvist, and Hirose 1981). In all four languages, the manner of the initial obstruent in the cluster, i.e. whether it was a continuant or not, determines the timing of the cluster's glottal gesture. These timing patterns disconfirm the predictions of the binding principle in two ways. First, it is the first rather than the last member of the cluster which determines when peak glottal opening occurs and, second, stops occurring later in 426
Articulatory binding
the cluster do not attract abduction to their releases, away from a preceding fricative. Peak glottal opening in clusters of a fricative followed by a stop does not occur at the same time relative to the oral articulations as it would for either a fricative or stop occurring alone. The most typical point is close to the boundary between the two oral articulations, a temporal compromise between the early peak of the fricative and the late peak of the stop. The first three languages also resemble one another in that the opening velocity of the glottis is consistently slower for sequences beginning with stops than fricatives. Furthermore, peak velocity is reached at the same time relative to the preceding end of voicing for all sequences beginning with the same manner of consonant. The slower velocity of abduction in stops undoubtedly accounts for their later peak opening compared to fricatives. However, peak velocity does not necessarily occur at the same time relative to voicing offset for stop as fricative initial sequences in all the languages. The velocity peak is reached at the same time for both stop and fricative initial clusters in Japanese and Swedish, but Swedish differs from Japanese in having a broader velocity peak in its stops than in its fricatives, while both stops and fricatives have a similar narrow velocity peak in Japanese. In Icelandic, on the other hand, the peak opening velocity peak is uniformly later for stops than fricatives. Lower opening velocity alone accounts for the later peak opening in stops than fricatives in Japanese, but in Swedish stops the sustained velocity peak leads to a larger as well as later abduction, and in Icelandic, the later peak velocity in stops compared to fricatives augments the effect of the stops' lower velocity of opening in delaying when peak glottal opening occurs. Despite the lack of agreement in the timing of glottal articulations among these languages, these data are a problem for the binding principle since for all three languages the velocity peak is controlled for stops as well as fricatives relative to the beginning of the consonant's articulation, not its release. These data suggest that the peak glottal opening in stops only appears to be coordinated with the release because it is later than in fricatives. The binding principle predicts that in obstruent clusters of mixed continuancy the stop rather than the fricative will determine when the peak glottal opening will occur because only the stop is released with a burst. That an initial fricative controls the timing of the glottal opening in fricative plus stop clusters is a real problem for the binding principle, despite the fact that a temporal compromise occurs between the two segments. This compromise is typically not sufficient, however, to prevent the neutralization of the aspiration contrast in the following stop. More generally, that the manner of the first rather than the last obstruent in a cluster determines when the peak glottal opening will occur clearly conflicts with the predictions of the binding principle. On the other hand, these differences between clusters beginning with fricatives and those beginning with stops may be an artifact. In all the data available on the timing of glottal articulations in obstruent clusters beginning with a fricative, that 427
JOHN KINGSTON
fricative is a sibilant. Sibilants are the noisiest of fricatives and they also have strong peaks in their spectra. In their high intensity and clear timbre, sibilants grossly resemble vowels, and these two properties make them detectable and identifiable at the syllable periphery. Producing such intense, spectrally distinct noise demands, in addition to a sufficiently small oral aperture to produce turbulence (Stevens 1971; Shadle 1985), a large enough glottal aperture to produce a high glottal air flow. Sibilants, and more importantly clusters containing them as the first consonant, may therefore be exempt from the constraints imposed by the binding principle since their requirements for a high glottal air flow probably exceed those of stops. Browman and Goldstein (1986) also suggest that sibilant plus stop clusters are a special case, though in their view this is because sequential oral articulations are incorporated into the domain of a single opening of the glottis, rather than because of the demand for high glottal air flow to produce high oral air flow by the sibilant. In their view, furthermore, the timing of the glottal gesture determines that of the oral gestures, rather than the other way around as the binding principle would have it. Though all voiceless fricatives may require a large glottal aperture and be expected to incorporate a following stop, nonsibilant fricatives do not generally occur external to stops in syllables. Since only sibilants and not all fricatives, much less all continuants, are frequently incorporated with following stops within single glottal gestures, this coordinative structure, with the loss of an aspiration contrast in the stop it produces, may not be generalizable to other sequences of a continuant followed by a stop. Other kinds of clusters of voiceless obstruents may exhibit multiple openings of the glottis, with partial adduction between them, for each segment that requires a high glottal airflowto produce noise (Lofqvist and Yoshioka 1981a, b; Yoshioka, Lbfqvist, and Hirose 1981, 1982). Multiple openings are controlled somewhat differently than single ones. Single openings in single segments and clusters are produced by the posterior cricoarytenoids and the interarytenoids contracting reciprocally in a classical antagonist pattern: contracting the posterior cricoarytenoids opens the glottis and then contracting the interarytenoids closes it; activity in the interarytenoids is suppressed when the posterior cricoarytenoids are active and vice versa. However, in clusters where the glottis opens more than once, activity is almost entirely suppressed in the interarytenoids throughout the cluster. The glottis either opens and closes with a waxing and waning of activity in the posterior cricoarytenoids alone or the small adductions come from slight increases in interarytenoid activity. These data are also problematic for the binding principle since the principle predicts the consolidation of all glottal gestures in a cluster into a single one, bound to the last stop in the cluster, rather than multiple openings. Consolidation in such clusters is evident only in the near total suppression of adductory activity in the interarytenoids. Turning now to the third problem, Hutters (1985) presents evidence from five 428
Articulatory binding
languages that the duration of aspiration varies inversely with the duration of the stop closure, apparently because the stop is released before peak glottal opening is reached in the languages with shorter closures, Danish and Hindi, but not until after peak glottal opening in the languages with longer closures, Swedish, English, and Icelandic. These data indicate that the glottal gesture is relatively invariant in its timing across these five languages, while the duration of the oral closure varies with respect to it. This suggestion acquires further support from Hutters's Danish data, in which the interval from the onset of the oral closure to peak glottal opening is nearly constant for stops differing in place of articulation, even though closure and aspiration durations vary inversely across places of articulation. The explanation for these differences cannot be found in the phonologies of these languages, since Icelandic and Danish employ aspiration in quite parallel ways to distinguish their two classes of stops from one another, as do Swedish and English. Browman and Goldstein (1986) claim that most typical glottal gesture in a syllable onset is a single ballistic opening of the glottis, which implies that the duration of glottal gestures should not vary substantially across languages. Closure durations also differ systematically between aspirated and unaspirated stops, being longer in the latter than the former. The duration of closure in an unaspirated stop is in fact approximately equal to the closure plus aspiration interval of an aspirated stop (Hutters 1985; also Weismer 1980).4 However, the fact that closure durations vary across languages in aspirated stops with complementary variation in the duration of aspiration does not mean that glottal articulations are not bound to oral ones. So long as the peak glottal opening occurs at some constant interval from the oral release, then it can be said to be bound to it. The binding principle does not require that this interval be the same for all languages. Significant counterevidence would be a demonstration that closure duration could vary in a single language without proportional variation in the timing of the peak glottal opening. In this connection, we might appeal to a difference between macro-binding, the pattern of covariation in the timing of one autonomous articulation with respect to another, and micro-binding, a specification that the bound articulation occur at a constant interval from the event to which it binds, here, peak glottal opening with respect to the stop release (the distinction is Ohala's, p.c).
23.5
Conclusion
The binding principle claims glottal articulations will bind more tightly to oral ones in stops than in continuants and that a glottal articulation would be coordinated with the release of the oral articulation because in that way the release would be shaped acoustically by the glottal articulation and thus convey the nature of that glottal articulation. Support for the first part of the principle has not been presented in this paper, and it will be taken up elsewhere. 429
JOHN KINGSTON
The second part of the principle was tested first against data from Icelandic and Hindi. One problem, the possibility of binding of glottal abduction to the onset of a following stop closure in the pre-aspirates in Icelandic was shown to be only an apparent problem. Since pre-aspiration is generally a phonetic property of heterosyllablic clusters ending in an underlying [ + spread] stop - through a shift of the glottal abduction to the first member of the cluster - while post-aspiration arises only when such stops occur singly or in tautosyllabic clusters, pre- and postaspiration may never contrast in the phonology of Icelandic. The apparent coordination of the glottal opening with the following oral closure was shown to be an artifact of variation in the amount of overlap between the oral closure and the interval when the glottis was completely abducted; no relational invariance between glottal abduction and oral closure was actually found. The absence of evidence of coordination between glottal abduction and the preceding closure in Icelandic post-aspirates remains unresolved. The data from Hindi more closely followed the predictions of the binding principle, but this suggested a difference in coordination schedules between the two languages rather than the contrast between stops of different types predicted by the binding principle. The unresolved problems may not expose weaknesses of the binding principle so much as they reflect difficulties in determining the tightness of binding. In particular, measuring covariation in duration may not accurately reveal how or what events are coordinated with one another. This would be especially likely if it were not the onset and offsets of articulations which are coordinated but rather something like their peak displacements. The evidence reviewed above suggests that it is peak opening of the glottis which exhibits invariance with respect to some part of the oral articulation it accompanies. Unless abduction were strictly ballistic, coordination of peak opening with the stop release would probably not be clearly observable in measurements of covariation in duration. Implicit in the traditional matrices of features employed until recently to represent the component articulations of segments is the claim that each articulatory gesture begins and ends simultaneously. The glottal and oral articulations of unaspirated stops and voiceless fricatives actually exhibit just this pattern of coordination, glottal abduction beginning at nearly the same time as the oral closure of constriction is complete and adduction being complete at the point when the closure or constriction is released. Furthermore, peak glottal opening is coordinated in unaspirated stops and fricatives with the onset of closure or constriction. Voiceless fricatives and aspirated stops contrast in the timing of glottal abduction, a difference which is perhaps best expressed in terms of contrasts in opening velocity, the fricatives being produced with much more rapid opening than the stops. Abduction in aspirated stops is, furthermore, coordinated with the release of the stop; it varies in when it occurs with respect to the onset of closure, but not with respect to the release. The explanatory power of the binding principle arises from the fact that the 430
Articulatory binding
contrast between aspirated stops on the one hand and voiceless fricatives or unaspirated stops on the other depends on timing differences. These timing differences are what is controlled in conveying these contrasts; their results are a markedly different acoustic quality between the burst and the brief interval after it in aspirated and unaspirated stops or between the early noise of fricatives compared to late noise in aspirated stops. The binding principle assures that these acoustic differences exist. Most troublesome in the long run for the binding principle is the fact that the timing of glottal abduction in voiceless obstruent clusters is determined by the manner of the initial rather than the final obstruent. The difficulty, of course, arises from the fact that that obstruent's release may not be acoustically signalled by a salient, transient event such as a burst. The more general problem for the binding principle posed by such data is that it suggests that glottal and oral articulations are scheduled with respect to the beginning of the articulatory unit rather than looking ahead to its end, despite the possibility that no audible release may occur in the middle of the cluster to convey the state of the glottis. The difficulty presented by these data is only reduced somewhat by the fact that in all the clusters beginning with fricatives, the fricative is a sibilant, which demands such a large glottal aperture that it may override the binding principle. This raises the question of how this principle applies in a grammar. Considering stops alone, the binding principle may be limited to single stops and not extend to clusters of stops. Rather than enforcing a consolidation of glottal gestures into a single gesture aligned with the audible release of a final stop in a cluster, the glottal gesture of each stop may bind to its release, whether or not it is audible. (Only when the cluster contains a word boundary are there separate openings; otherwise, it appears that only a single glottal abduction occurs.) If the possibility of an audible release, rather than its actual occurrence, is sufficient to determine when glottal abduction occurs, then the binding principle only determines the schedule of articulation within the domain of a single segment, even if that segment forms part of a larger articulatory unit. The schedule of glottal articulations remains what it would be if the segment occurred alone.5 Such clusters would differ from those where all the segments' glottal gestures are consolidated into a single one. This latter result closely resembles Browman and Goldstein's (1986) demonstration that the movement of the lower lip in bilabial articulations is the same for [p], [b], [m] and [mb], [mp], i.e. whether the accompanying soft palate articulation is single or double. The acoustic result which the binding principle is intended to assure, a burst positively modified by the glottal articulation bound to it, is apparently dispensable in clusters, perhaps because the glottal articulation is predictable within the most clusters in the languages investigated. (I had originally hoped that the binding principle would provide an account of this predictability.) Restricting the binding principle to single consonants shows that neither it nor other constraints are solely responsible
JOHN KINGSTON
for determining when glottal articulations occur relative to oral ones in consonants. Coordination arises instead out of a combination of constraints.
Notes 1 What is essential here is whether the supraglottal articulation causes pressure to build up above the glottis. The binding principle therefore does not distinguish between nasal and nonnasal sonorants, i.e. between segments in which air is allowed to escape through the nose from those where it escapes through the mouth. 2 Voiceless fricatives can be divided (coarsely) into a frication portion followed by an aspiration portion. It is, however, uncertain whether the transition from frication to aspiration is a product of relaxation of the oral constriction, reducing oral air flow directly, or of adduction of the glottis, reducing oral airflowthrough reduction of glottal air flow. If the latter is the source of this asymmetry, then voiceless fricatives are the articulatory complement of aspirated stops, at least in the relative timing of their oral and glottal articulations. If voiced fricatives are similarly asymmetric, then they are the articulatory complement of voiced stops, in which typically the first part is voiced, but the later portion may exhibit devoicing. 3 The obstruent clusters examined in Japanese included strings derived by devoicing of an intervening vowel. Vowel devoicing is the only means by which "clusters" whose members differ in continuancy may be obtained in this language. These appeared to exhibit the same temporal coordination between oral and glottal articulations as underlying clusters, which in Japanese consist only of geminates, i.e. consonants with the same specification for continuancy (and place). 4 In English, the duration of constriction in voiceless fricatives is also approximately equal to the duration of closure in unaspirated stops (Weismer 1980), but both the timing of the gesture and its magnitude must be specified because there is an early and large peak in voiceless fricatives compared to the early but small peak in voiceless unaspirated stops and to the late and large peak in voiceless aspirated ones. The two kinds of stops still cannot be produced with the same abductory gesture since the peak opening is much larger in the aspirated than unaspirated stops.It is also not possible to specify velocity alone, since that would not predict the differences in the size of the opening between aspirated and unaspirated stops. 5 Strictly speaking, this is not true, since as the evidence of sibilant plus stop clusters shows the peak in the glottal abduction occurs at the boundary between the sibilant and stop articulations, a shift to a later point compared to the fricative occurring alone.
References Benguerel, A.-P. and T. Bhatia. 1980. Hindi stop consonants: An acoustic and fiberscopic study. Phonetica 37: 134-148. Bickley, C. 1982. Acoustic analysis and perception of breathy vowels. MIT. RLE. Speech Communication Group. Working Papers 1: 71-81. Browman, C. and L. Goldstein. 1985. Dynamic modeling of phonetic structure. In V. Fromkin (ed.) Phonetic Linguistics. Essays in Honor of Peter Ladefoged. New York: Academic Press, 35-53. 1986. Towards an articulatory phonology. Phonology Yearbook 3: 219-252. Chasaide, A. Ni. 1986. The perception of preaspirated stops. Presented at the 111th 432
Articulatory binding meeting of the Acoustical Society of America, Cleveland. Journal of the Acoustical Society of America 79: S7. Clements, G. N. 1985. The geometry of phonological features. Phonology Yearbook 2: 225-252. Clements, G. N. and K. C. Ford. 1979. Kikuyu tone shift and its synchronic consequences. Linguistic Inquiry 10: 179-210. Clements, G. N. and S. J. Keyser. 1983. CV Phonology: A Generative Theory of the Syllable. Cambridge: MIT Press. Clements, G. N. and E. Sezer. 1982. Vowel and consonant disharmony in Turkish. In H. Van der Hulst and N. Smith (eds.) The Structure of Phonological Representations. 2. Dordrecht: Foris, 213-255. Cohen, J. and P. Cohen. 1983. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum. Collier, R., L. Lisker, H. Hirose, and T. Ushijima. 1979. Voicing in intervocalic stops and fricatives in Dutch. Journal of Phonetics 7: 357-373. Flege, J. 1982. Laryngeal timing and the phonation onset in utterance-initial stops. Journal of Phonetics. 10: 177-192. Fowler, C. 1980. Coarticulation and theories of extrinsic timing. Journal of Phonetics 8: 113-133. Games, S. 1976. Quantity in Icelandic: Production and Perception. Hamburg: Helmut Buske. Goldsmith, J. 1976. Autosegmental phonology. Ph.D. dissertation, MIT. Hayes, B. 1986. Inalterability in CV phonology. Language 62: 321-351. Henke, W. 1966. Dynamic articulatory model of speech production using computer simulation. Ph.D. dissertation, MIT. Hirose, H. and T. Gay. 1972. The activity of the intrinsic laryngeal muscles in voicing control. Phonetica 25: 140-164. Hirose, H., L. Lisker, and A. Abramson. 1972. Physiological aspects of certain laryngeal features in speech production. Haskins Laboratories Status Report on Speech Research SR-31/32: 183-191. Hutters, B. 1985. Vocal fold adjustments in aspirated and unaspirated stops in Danish. Phonetica 42: 1-24. Kagaya, R. and H. Hirose. 1975. Fiberoptic, electromyographic, and acoustic analysis of Hindi stop consonants. University of Tokyo. Research Institute of Logopedics and Phoniatrics. Annual Bulletin 9: 27-46. Keating, P. 1985. CV phonology, experimental phonetics, and coarticulation. University of California, Los Angeles. Working Papers in Phonetics 62: 1-13. Kenstowicz, M. 1982. Gemination and spirantization in Tigrinya. Studies in the Linguistic Sciences 12. Urbana: University of Illinois, 103-122. Ladefoged, P. 1982. The linguistic use of different phonation types. University of California, Los Angeles. Working Papers in Phonetics 54: 28-39. Leben, W. 1978. The representation of tone. In V. Fromkin (ed.) Tone: A Linguistic Survey. New York: Academic Press, 177-219. Lofqvist, A. 1980. Interarticulator programming in stop production. Journal of Phonetics. 8: 475^90. Lofqvist, A. and H. Yoshioka. 1980. Laryngeal activity in Swedish obstruent clusters. Journal of the Acoustical Society of America 68: 792-801. 1981a. Interarticulator programming in obstruent production. Phonetica 38: 21-34. 1981b. Laryngeal activity in Icelandic obstruent production. Nordic Journal of Linguistic 4: 1-18. 433
JOHN KINGSTON
McCarthy, J. 1986. OCP effects: gemination and antigemination. Linguistic Inquiry 17: 207-263. McCarthy, J. and A. Prince. 1986. Prosodic morphology. MS, University of Massachusetts, Amherst, and Brandeis University, Waltham. Munhall, K. 1985. An examination of intra-articulator timing. Journal of the Acoustical Society of America 78: 1548-1553. Ohala, J. and B. Lyberg. 1976. Comments on "Temporal interactions within a phrase and sentence context." Journal of the Acoustical Society of America 59: 990-992. Ohman, S. 1966. Coarticulation in VCV utterances: Spectrographic measurements. Journal of the Acoustical Society of America 39: 151-168. 1967. A numerical model of coarticulation. Journal of the Acoustical Society of America 41: 310-320. Petursson, M. 1972. La preaspiration in islandais moderne. Examen de sa realisation phonetique chex deux sujets. Studia Linguistica 26: 61-80. 1976. Aspiration et activite glottal. Examen experimental a partir de consonnes islandaises. Phonetica 33: 169-198. Sagey, E. 1986. The representation of features and relations in nonlinear phonology. Ph.D. dissertation, MIT. Schein, B. and D. Steriade. 1986. On geminates. Linguistic Inquiry 17: 691-744. Shadle, C. 1985. The acoustics of fricative consonants. MIT. RLE. Technical Report 506. Ph.D. dissertation, MIT. Shattuck-Hufnagel, S. and D. Klatt. 1979. The limited use of distinctive features and markedness in speech production: Evidence from speech error data. Journal of Verbal Learning and Verbal Behavior 18: 41-55. Steriade, D. 1982. Greek prosodies and the nature of syllabification. Ph.D. dissertation, MIT. Stevens, K. 1971. Air flow and turbulence noise for fricative and stop consonants: Static considerations. Journal of the Acoustical Society of America 50: 1180-1192. Thrainsson, H. 1978. On the phonology of Icelandic preaspiration. Nordic Journal of Linguistics 1: 3—54.
Trubetzkoy, N. 1939. Principles of Phonology. Trans. C. Baltaxe, 1969. Berkeley and Los Angeles: University of California Press. Tuller, B. and J. A. S. Kelso. 1984. The timing of articulatory gestures: Evidence for relational invariants. Journal of the Acoustical Society of America 76: 1030-1036. Weismer, G. 1980. Control of the voicing distinction for intervocalic stops and fricatives: Some data and theoretical considerations. Journal of Phonetics 8: 427^38. Yoshioka, H., A. Lofqvist, and H. Hirose. 1981. Laryngeal adjustments in the production of consonant clusters and geminates in American English. Journal of the Acoustical Society of America 70: 1615-1623. 1982. Laryngeal adjustments in Japanese voiceless sound production. Journal of Phonetics 10: 1-10.
434
24 The generality of articulatory binding: comments on Kingston's paper JOHN J. OHALA
24.1
Introduction
I think Kingston's "binding principle" is an extremely important concept for phonology which, once it is properly defined, will turn out to have application to a wider range of phonological facts than those he reviewed and will, moreover, constitute a major challenge to current phonological theory. In this comment I will attempt to (a) suggest a definition of binding which could survive the negative evidence Kingston found for it, (b) exemplify the wider range of phenomena which manifest binding, and (c) indicate how these phenomena challenge modern phonological practice and theory. I will also offer a brief digression on the history of the use of distinctive features in phonology.
24.2 The binding principle As I understand it, the "binding principle" recognizes that the vocal tract, though consisting of individual parts (glottis, tongue, velum, lips, etc.), is nevertheless a unified integrated "whole" which requires the orchestration of the parts to achieve certain ulterior goals. Intermediate goals include certain aerodynamic targets but the ultimate goals are acoustic—auditory events (cf. Jakobson, Fant, and Halle's [1951: 13] dictum, "we speak to be heard in order to be understood"). More specifically, the principle states that the vocal tract is subject to physical constraints that require temporal coordination between different, sometimes "distant" articulators. Kingston focuses on the case of stop bursts - a highly salient, highly informative acoustic event — which, to be produced correctly, require the cooperation of the glottis and the supraglottal articulators (including the velum). In Kingston (1985, and in the long version of the paper under discussion) he reviewed diachronic evidence that the glottal component of glottalized sonorants such as / m / , perhaps phonetically [?m] or [m?], migrates more freely onto adjacent segments tKan does that of glottalized stops, i.e. ejectives 435
JOHN J. OHALA
such as ft']. The binding principle, however, can be invoked to explain many other interesting sound patterns, a sample of which I review below.
24.3 How to test for evidence of binding Kingston deserves praise not only for formulating this interesting and far-reaching principle but also for bravely putting it to the test. The results he reports were less than favorable to the explicit statement of the implications of the principle, namely, the existence of stronger positive correlations between glottal and supraglottal events at stop release in comparison to stop onset. However, I think he expected a tighter temporal binding between articulators than is strictly necessary to achieve a coordinated gesture such as a stop burst. Kingston himself suggests a more reasonable view (though he rejects it), namely that "simple overlap between the two articulators might accomplish [the required coordinated gesture]." I think this is the right approach: co-occurrence of states rather than correlation of gestures. What is important for a stop burst is the state of the glottis when the oral constriction is released, not, strictly speaking, when this state was achieved or when it was changed. Also, as I will suggest below, binding between articulators may be much more pervasive than Kingston suggests. Thus, it is not clear that just because binding must exist between articulators in order to produce a good burst, it therefore should not be manifested during other acoustic events, including stop onset.
24.4
A digression: correcting revisionist histories of phonology
Kingston implies that he is unaware of proposals earlier than that of Trubetzkoy (1939) to "break the segment into a bundle of distinctive features". In a similar vein, Halle (1985) remarks that in Jakobson's Remarques sur revolution phonologique du russe... (1929) one can find the important insight that phonemes are not the ultimate atoms of language, but that phonemes are themselves composed of readily identifiable properties (later called distinctive features) and that it is these properties that are directly involved in phonetic change. As a result, phonetic change affects whole classes of sounds rather than individual phonemes... This kind of generalization had no special status in pre-Jakobsonian linguistics, where phonemes are the ultimate constituent elements, and the fact that certain subsets of phonemes undergo common changes whereas others do not, remains without explanation. That speech sounds can be decomposed into features is evident in the work of Panini (Allen 1953) as well as the ancient Greeks and Romans (whence the traditional features "tenues" and "mediae"), not to mention a host of 436
The generality of articulatory binding
grammarians, philologists, spelling reformers, inventors of shorthand systems, physiologists, physicists, and others even if their recognition of features in a few cases amounted to no more than giving a chart of speech sounds with the columns and rows labeled with terms "lingual," "guttural," "voiced," "nasal," and the like. Rask (1818) and Grimm (1822 [1967]), among others, employed such featural terms to express such generalizations as In the labial, lingual and guttural sounds, the Gothic... tenues correspond to the High German aspirates; the Gothic aspirates to the High German mediae. (Grimm p. 49) Furthermore, there are phonetic transcriptions which explicitly code features by means of diacritics or special "squiggles" on the invented letters, or even by fullblown separate symbols for each feature, among them Wilkins (1668), ten Kate (1723), Pitman (1840), Brucke (1863), Bell (1867), and Jespersen (1889). Jespersen, moreover, explicitly remarks that his feature notation "enables us to give comprehensive formulas of such general sound changes as have affected a whole series of sounds at once." (p. 81) He then goes on to give examples where the featural notation assists in the explanation of a specific sound change.
24.5
Further examples of binding
I now propose to give further examples of binding - as defined in modified form above - in support of the claim that it is quite pervasive in speech and has important consequences for phonological representation. A. Soft palate elevation during oral obstruents
It is well known that oral - as opposed pharyngeal and glottal - obstruents require elevation of the soft palate (Ohala 1971a, 1975, 1983a; Schourup 1973) because a lowered soft palate would allow venting of air and thus reduce the pressure build up needed for an obstruent. Pharyngeal and glottal obstruents - and it should be kept in mind that from an aerodynamic point of view any voiced sonorant is a glottal obstruent since it obstructs the flow of air at the glottis - are not subject to this aerodynamic constraint because the build up of air pressure occurs "upstream" of the site where a lowered soft palate could cause venting. This binding of the states of the various air valves in the vocal tract has a variety of phonological consequences, among them the blocking perseveratory nasalization (triggered by a nasal consonant) by oral obstruents, but not glottal obstruents, in languages such as Sundanese (Robins 1957), as exemplified by words such [nahokYn], "to inform," and [ml?asih], "to love." Conversely the existence of a lowered soft palate prevents phonological processes which devoice and/or fricativize various segments (Ohala 1983a). For 437
JOHN J. OHALA
example, in Fante /hi/ "border," is realized as [91], where /hi/, "where," becomes [hi] ( = [11]), not *[ci] (Schachter and Fromkin 1968). Here an open velic valve prevents the build up of air pressure which would create frication in the air channeled through the palatal constriction of the palatal allophone of / h / which appears before the palatal vowels /11/. B. Requirements for voicing dictate preferred duration of obstruent closure
Voicing requires air flow through the glottis; air flow in turn depends on the pressure drop across the glottis. Supraglottal obstruents, by their very nature, reduce the transglottal pressure drop. Ohala and Riordan (1979) presented evidence that stops could maintain sufficient air flow for voicing by means of passive expansion of the oral cavity for about 64 msec (see also Westbury 1979). Ohala (1983a) speculates that this is the cause of the widespread cross-language tendency for voiced obstruents to be shorter than voiceless obstruents (Westbury 1979: 98). I would further speculate that this relative shortness of voiced stops mandated by the glottal conditions required for voicing is at least partly the cause of the greater tendency for voiced stops to become spirants intervocalically vis-avis the voiceless stops since the shorter the closure, the more likely it is to be incomplete, thus a spirant; cf. Spanish /radon/ [radon] "radon" vs. /raton/ [raton] "mouse." Another consequence of this constraint is that the longer a stop closure is, the more is it susceptible to devoicing; this is presumably what is at work in the following morphophonemic alternations from More (Alexandre 1953): /bad 4do/>/bato/ "corbeilles," /lug + gu/>/luku/ "enclos." Nubian (Bell 1971; Ohala and Riordan 1979) shows this same constraint and a corollary of it: stops articulated further back in the mouth are more susceptible to devoicing that those made further forward since back-articulated ones have less opportunity for expansion of the volume behind the place of closure (see example 1). Nouns are inflected for conjunction by suffixing / — on/ and gemmination of the stem-final consonant. Of interest here is what happens to final voiced steps when they are gemminated. (1)
Noun stem /fab/ /seged/ /kad 3 / /mog/
Noun Stem-f and /fab:on/ /segeton/ /katj:on/ /mok:on/
"father" "scorpion "donkey" "dog"
As is well known, this latter constraint, the difficulty of maintaining voicing during back-articulated stops, accounts for the wide-spread gaps in the velar and uvular positions among voiced stops series, e.g. in Dutch and Thai, Efik, Quiche (Chao 1936; Greenberg 1970; Gamkrelidze 1975; Javkin 1977; Ohala 1983a; Pinkerton 1986). 438
The generality of articulatory binding
For a fuller treatment of these facts, including important constraints affecting voiced fricatives, see Ohala (1983a). Thus, not only is there a constraint binding the activity at the glottis with that in the supraglottal cavity, but a full account must incorporate the gradient influence of relative volume behind stop closures. C. Acoustic consequences of multiple constrictions in the vocal tract
The resonances of the vocal tract are those acoustic frequencies whose standing waves optimally conform to the geometric constraints of the vocal tract; figure 24.1 shows the standing (velocity) waves for a uniform tract - i.e. one with approximately equal cross-dimensional area from glottis to lips - on the right in a schematic "straightened out" tract with the maxima in the standing wave indicated by arrows and on the left the standard sagittal section with the approximate locations of the same maxima indicated by filled circles. It is a simple matter to predict from this that the lowest three resonances of such a tract, Rl, R2, and R3 - equivalent for our purposes to Fl, F2, F3 - will have frequencies in the ratio 1:3:5, respectively, where Rn =
(2 "-1)C 4 x (vocal tract length)
(where c = the speed of sound) (Chiba and Kajiyama 1941; Fant 1960). The following rules of thumb (see Chiba and Kajiyama; Fant) give a qualitative prediction of what happens to the frequencies of these resonances when the tract does not have uniform cross-dimensional area, i.e. when it has a constriction somewhere: a) a constriction at a minimum in the velocity standing wave of a given resonance will elevate the frequency of that resonance, and b) a constriction at a maximum in the velocity standing wave of a given resonance will lower the frequency of that resonance. As these principles affect Rl, it is evident why the palatal and labial constrictions for [i] and [u], respectively, which are near the velocity maximum of Rl, create a low Fl for these vowels, while [a] which has a constriction in the pharyngeal region near the velocity minimum has a high Fl. However, of particular interest for the binding principle is the fact that two or more simultaneous constrictions, if placed strategically, can reinforce each others effect. Thus, it is
evident that simultaneous constrictions in the labial and velar-uvular region (the sites of R2's two velocity maxima) will lower F2 more than either constriction by itself. This is why, of all the possible simultaneous double constrictions (e.g., labial palatal, palatal pharyngeal), it is the labial and velar which are most often utilized 439
JOHN J. OHALA
Figure 24.1 The standing (velocity) waves of the lowest three resonant frequencies (from top to bottom, R1, R2, and R3) for a uniform vocal tract (one with approximately equal cross-dimensional area from glottis to lips); on the right, in a schematic "straightened out" tract with the maxima in the standing wave indicated by arrows and on the left the standard sagittal section with the approximate locations of the same maxima indicated by filled circles.
in languages' segment inventories (for the sounds such as [u w kvv gw kp gb]]; see also Ohala and Lorentz 1977; Ohala 1979a,b). I am assuming here that these are "good" or communicatively efficient sounds since by creating extreme modulations of a salient acoustic parameter (F2) they can be made to sound as different as possible from other sounds in the languages' segment inventories. An even more dramatic manifestation of binding is found in the case of the American English / r / ( = IPA[J]), which, unlike most vowels or related glides found in languages of the world, has three constrictions, one at the lips, one in the mid-palatal region, and one in the pharynx (Uldall 1958; Delattre 1971). It is also acoustically unusual (and thus highly distinct) in that it has a remarkably low F3. We can see from figure 24.1 that these two facts are related: the three constrictions 440
The generality of articulatory binding
are in precisely the locations - where there are velocity maxima in R3's standing wave - required to maximally lower the frequency of R3 (Ohala 1985). Another example of binding due to acoustic constraints (but explainable by principles other than those reviewed above) is the interaction between vowel quality and nasalization (Ohala 1971b, 1975, 1986; Wright 1986; Beddor, Krakow, and Goldstein 1986). Although the facts are complicated (and involve auditory effects in addition to acoustic ones) there is clear evidence, both phonetic and phonological, that nasalization distorts the Fl of vowels and thus influences their apparent height, leading to sound changes and morphological variation such as French [fin] "brandy," [fs] "refined, choice." There is also evidence that subglottal resonances can interact with the oral ones to give rise to acoustic effects that mimic nasalization and thus trick some listeners into producing the affected sounds with actual nasalization (Ohala 1975, 1980, 1983b; Ohala and Amador 1981). A final example - among many that could be cited - is the strong tendency for nasals assimilating to labial velar consonants ([w, kp, gb]) to be the velar [rj] rather than the labial [m] - due to the way oral cavity configuration affects the spectrum of the nasal murmur (Ohala and Lorentz 1977; Ohala 1979a, 1979b). D. Conclusions from phonological data
I have presented evidence that the binding principle covers a wide range of phonological data that are motivated not only by aerodynamic but also by acoustic constraints. The goal of binding is the production of a distinct acoustic-auditory event, e.g. a stop burst, frication, aspiration. In a larger sense, all speech gestures, even a simple [m] or [1], require binding in order that different acoustic-auditory events result. The temporal binding of different articulators may not be as "tight" as it is in the case of burst production which, in contrast to [m] and similar sounds, is a very brief event, but binding exists nevertheless. Binding, therefore, should be viewed as a continuum from "tight" to "loose" and thus extended to cover brief events like bursts, somewhat longer events like [m] and [1], and even longer events like distinctive palatalization or pharyngealization which might require 150 ms or more for their implementation and perception. Such a classification of the distinctions used in speech has other phonological uses (Whitney 1874: 298; Stevens 1980; Ohala 1981). The short distinctions, e.g., stop, are acoustically robust and easily detected. They tend to be utilized first in the development of sound inventories in that languages with small inventories use these only. Distinctions with a longer time span are utilized only by languages with relatively large inventories which have already exhausted the short robust ones. The robust distinctions also appear to be less subject to the "long range" assimilations and dissimilations, that the longer, weaker distinctions often are. (There are complications in locating specific contrasts on this continuum, e.g. in the case of nasal consonants their abrupt amplitude and spectral discontinuities constitute a brief, robust cue whereas the nasalization they tend to induce on adjacent vowels
JOHN J. OHALA
is relatively long in duration and perceptually less robust. Conceivably, it is "cues" that should be classified according to this robust/nonrobust criteria, not the traditional notions of "feature" or "distinction" which we now know to be just cover terms for multiple cues, cf. Kohler (1984).
24.6
Conclusion
Kingston quite correctly offers the existence of binding as a challenge to current phonological, especially autosegmental, practice since purely formal mechanisms will be unable to account in any non-ad hoc way for the coordination that must occur between different articulators. Autosegmental representation - like the traditional linear representation it desires to replace - would treat as equal the phonological process that changes underlying voiced geminates to voiceless (as in More, above) - where supraglottal events dictate the state at the glottis - and the unnatural one in which underlying voiceless geminates become voiced. It would treat as equal processes that devoiced back-articulated stops but not more forwardarticulated ones (as in Nubian, above) and unexpected processes that did the reverse. The representation would not be able to differentiate between processes where nasalization affected vowel height as opposed, say, to vowel rounding. The problem pointed out by Chomsky and Halle (1968: 400) that phonological representations fail to represent the "intrinsic content" of speech evidently still plagues us. It will continue to plague us as long as phonologists try to answer their questions using compass and straightedge - or other tools mismatched to the task - instead of making some effort to study and incorporate into their representations the physical and psychological structure of speech. References Alexandre, R. P. 1953. La Langue More. Dakar: Memoires de l'lnstitut francais d'Afrique noire, no. 34. Allen, W. S. 1953. Phonetics in Ancient India. London: Oxford University Press. Beddor, P. S., R. A. Krakow, and L. M. Goldstein. 1986. Perceptual constraints and phonological change: A study of nasal vowel height. Phonology Yearbook 3: 197-217. Bell, A. M. 1867. Visible Speech. London: Simpkin, Marshall & Co. Bell, H. 1971. The phonology of Nobiin Nubian. African Language Review 9: 115-159. Briicke, E. W. v. 1863. Ueber eine neue Methode de phonetischen Transscription. In Sitzungsberichte der Philosophisch-Historischen Classe der Kaiserlichen Akademie der Wissenschaften Wien 41: 223-285. Chao, Y. R. 1936. Types of plosives in Chinese. In Proceedings of the 2nd International Congress of Phonetic Sciences. Cambridge: Cambridge University Press, 106-110. Chiba, T. and M. Kajiyama. 1941. The Vowel, Its Nature and Structure. Tokyo: Kaiseikan Publishing Co. Chomsky, N. and M. Halle. The Sound Pattern of English. New York: Harper and Row. 442
The generality of articulatory binding Delattre, P. 1971. Pharyngeal features in the consonants of Arabic, German, Spanish, French, and American English. Phonetica 23: 129-155. Fant, G. 1960. Acoustic Theory of Speech Production. The Hague: Mouton. Gamkrelidze, T. V. 1975. On the correlation of stops and fricatives in a phonological system. Lingua 35: 231-261. Greenberg, J. H. 1970. Some generalizations concerning glottalic consonants, especially implosives. International Journal of American Linguistics 36: 123-145. Grimm, J. 1822. Deutsche Grammatik. Vol. 1. 2nd ed. Gottingen: Dieterichschen Buchhandlung. [English translation in: Lehmann, W., ed. 1967. A Reader in Nineteenth Century Historical Indo-European Linguistics. Bloomington: Indiana University Press.] Halle, M. 1985. Remarks on the scientific revolution in linguistics 1926-1929. Studies in the Linguistic Sciences 15: 61-77. Jakobson, R. 1929. Remarques sur revolution phonologique du russe comparee a celle des autres langues slaves. Prague. Jakobson, R., G. Fant, and M. Halle. 1952. Preliminaries to speech analysis. The distinctive features and their correlates. Technical Report No. 13. Cambridge, MA: Acoustics Laboratory, MIT. Javkin, H. 1977. Towards a phonetic explanation for universal preferences in implosives and ejectives. Proceedings of the Annual Meeting of the Berkeley Linguistics Society 3: 559-565. Jespersen, O. 1889. Articulation of Speech Sounds, Represented by Means of Analphabetic Symbols. Marburg in Hesse: N. G. Elwert. Kingston, J. C. 1985. The phonetics and phonology of the timing of oral and glottal events. Ph.D. dissertation, University of California, Berkeley. Kohler, K. 1984. Phonetic explanation in phonology: The feature fortis/lenis. Phonetica 41: 150-174. Ohala, J. J. 1971a. Monitoring soft-palate movements in speech. Project on Linguistic Analysis Reports (Berkeley) 13: J01-J015. Ohala, J. J. 1971b. The role of physiological and acoustic models in explaining the direction of sound change. Project on Linguistic Analysis Reports (Berkeley) 15: 25-40. Ohala, J. J. 1975. Phonetic explanations for nasal sound patterns. In C. A. Ferguson, L. M. Hyman, and J. J. Ohala (eds.) Nasalfest: Papers from a Symposium on Nasals and Nasalization. Stanford: Language Universals Project, 289-316. Ohala, J. J. 1979a. The contribution of acoustic phonetics to phonology. In B. Lindblom and S. Ohman (eds.) Frontiers of Speech Communication Research. London: Academic Press, 355-363. Ohala, J. J. 1979b. Universals of labial velars and de Saussure's chess analogy. Proceedings of the 9th International Congress of Phonetic Sciences. Vol. 2. Copenhagen: Institute-of Phonetics, 41-47. Ohala, J. J. 1980. The application of phonological universals in speech pathology. In N. J. Lass (ed.) Speech and Language. Advances in Basic Research and Practice. Vol. 3. New York: Academic Press, 75-97. Ohala, J. J. 1981. The listener as a source of sound change. In C. S. Masek, R. A. Hendrick and M. F. Miller (eds.) Papers from the Parasession on Language and Behavior. Chicago: Chicago Linguistic Society, 178-203. Ohala, J. J. 1983a. The origin of sound patterns in vocal tract constraints. In P. F. MacNeilage (ed.) The Production of Speech. New York: Springer-Verlag, 189-216. Ohala, J. J. 1983b. The phonological end justifies any means. In S. Hattori and 443
JOHN J. OHALA
K. Inoue (eds.) Proceedings of the 13th International Congress of Linguists, Tokyo, 29 Aug.-4 Sept. 1982, 232-243. [Distributed by Sanseido Shoten.] Ohala, J. J. 1985. kvonnd flat. In V. Fromkin (ed.) Phonetic Linguistics. Essays in Honor of Peter Ladefoged. Orlando, FL: Academic Press, 223-241. Ohala, J. J. 1986. Phonological evidence for top-down processing in speech perception. In J. S. Perkell and D. H. Klatt (eds.) Invariance and Variability in Speech Processes. Hillsdale, NJ: Lawrence Erlbaum, 386-397. Ohala, J. J. and M. Amador. 1981. Spontaneous nasalization. Journal of the Acoustical Society of America 68: S54-S55. [Abstract] Ohala, J. J. and J. Lorentz. 1977. The story of [w]: an exercise in the phonetic explanation for sound patterns. Proceedings of the Annual Meeting of the Berkeley Linguistics Society 3: 577-599. Ohala, J. J. and Riordan, C. J. 1979. Passive vocal tract enlargement during voiced stops. In J. J. Wolf and D. H. Klatt (eds.) Speech Communication Papers. New York: Acoustical Society of America, 89-92. Pinkerton, S. 1986. Quichean (Mayan) glottalized and nonglottalized stops: A phonetic study with implications for phonological universals. In J. J. Ohala and J. J. Jaeger (eds.) Experimental Phonology. Orlando, FL: Academic Press, 125-139. Pitman, I. 1840. Phonography or Writing by Sound. London. Samuel Bagster & Sons. Rask, R. K. 1818. Undersegelse om det gamle Nordiske eller Islandske Sprogs Oprindelse. Copenhagen: Gyldendalske Boghandlings Forlag. Robins, R. H. 1957. Vowel nasality in Sundanese. In Studies in Linguistic Analysis. Oxford: Blackwell, 87-103. Schachter, P. and Fromkin, V. 1968. A phonology of Akan: Akuapem, Asante, and Fante. UCLA Working Papers in Phonetics 9. Schourup, L. C. 1973. A cross-language study of vowel nasalization. Ohio State University Working Papers in Linguistics 15: 190—221. Stevens, K. N. 1980. Discussion during Symposium on phonetic universals in phonological systems and their explanation. Proceedings of the 9th International Congress on Phonetic Sciences, Vol. 3. Copenhagen: Institute of Phonetics, 185-186. ten Kate, L. 1723. Aenleiding tot de Kennisse van het verhevene deel der Nederduitsche Sprake. Vol. I. Amsterdam: Rudolph en Gerard Wetstein. Trubetzkoy, N. S. 1939. Grundziige der Phonologie. Prague. Uldall, E. T. 1958. American 'molar' r and 'flapped' r. Revista do Laboratono de Fonetica Experimental (Coimbra) 4: 103-106. Westbury, J. R. 1979. Aspects of the temporal control of voicing in consonant clusters in English. Ph.D. dissertation, University of Texas, Austin. Whitney, W. D. 1874. Oriental and Linguistic Studies. Second series. The East and West; Religion and Mythology; Orthography and Phonology; Hindu Astronomy. New York: Scribner, Armstrong & Co. Wilkins, J. 1668. An Essay towards a Real Character and a Philosophical Language. London: Sa: Gellibrand, John Martin. Wright, J. T. 1986. The behaviour of nasalized vowels in the perceptual vowel space. In J. J. Ohala and J. J. Jaeger (eds.) Experimental Phonology. Orlando, FL: Academic Press, 45-67.
444
25 On articulatory binding: comments on Kingston*s paper LOUIS
GOLDSTEIN
Kingston's paper addresses a fundamental problem for the linguistic analysis of speech, namely the coordination among the basic articulatory events, or gestures, in the various speech subsystems. As discussed in Kingston and elsewhere (e.g. Browman and Goldstein 1986), principles of coordination are required once the constraint of simultaneity (implicit in the traditional view of the phonological segment) is abandoned, as it has been both on purely phonological grounds (e.g., Goldsmith 1976; Clements and Keyser 1983) and by virtue of the actual observed articulations (e.g. Bell-Berti and Harris 1981; Lofqvist 1980; Fujimura 1981). Kingston's contribution is to explore a principle that governs certain aspects of laryngeal-oral coordination. The attractiveness of the proposal is that it is wellmotivated in terms of the actual physics of speech - it is not an arbitrary stipulation. On the other hand, certain data suggest that the proposal, as stated, is not correct. The failure of this reasonable-looking principle may lead us in somewhat different directions in the search for principles of gestural organization. The simple version of Kingston's binding principle (section 23.2) predicts that the timing of glottal articulations with respect to oral constrictions will be more tightly constrained ("bound") the greater the degree of the oral constriction. The rationale here is that for stops, the glottal opening associated with voicelessness will have a "distal" effect on the characteristics of the stop release burst, as well as a "proximal" effect on the source characteristics. The characteristics of the stop release burst will be influenced by the intraoral pressure behind the stop closure. This pressure will, in turn, be influenced by the glottal state during closure; the glottis acts as a kind of "flow regulator"-the greater the glottal opening, the greater the intraoral pressure buildup can be. On the other hand, Kingston argues, for approximants there are no such "distal" aerodynamic consequences, and therefore their laryngeal articulations can be timed more loosely with respect to their oral constrictions. Fricatives are seen as intermediate, since for them, there is a distal function of glottal opening (it contributes to keeping oral flow rate high enough to produce turbulence). Kingston suggests that nevertheless, fricatives 445
LOUIS GOLDSTEIN
"often pattern with sonorants rather than stops" in tightness of binding. This remark is presumably based on how stops and fricatives pattern phonologically: e.g. rules that "move" glottal gestures with respect to oral ones; such evidence is detailed in Kingston's earlier work (Kingston 1985). The binding principle leads Kingston to some specific predictions, two of which will be examined here. First, glottal articulations should be most tightly bound to oral stop gestures, somewhat less tightly bound to oral fricative gestures, and least tightly bound to approximant gestures. Second, glottal articulations should be bound to the release of stop gestures (since modifying the character of this burst is the distal function of the glottal gesture), but this additional constraint should not be true for fricatives, since the distal effect (turbulence production) is required throughout the length of the fricative. This leads Kingston to suggest that glottal articulations may bind to either fricative onsets or offsets. Different patterns of coordination of glottal opening gestures with stops and with fricatives have, in fact, been observed. Browman and Goldstein (1986:228) proposed a principle of glottal gesture coordination for word-initial stops and fricatives in English, repeated as (1) below: (1)
Glottal gesture coordination in English a. If a fricative gesture is present, coordinate the peak glottal opening with the midpoint of the fricative. b. Otherwise, coordinate the peak glottal opening with the release of the stop gesture.
The statement in (1) was based on the experimental results of Yoshioka, Lofqvist, and Hirose (1981) and Lofqvist (1984). Using photoelectric glottography with two speakers of American English, Lofqvist and Yoshioka (1984) found that the peak glottal opening is roughly simultaneous with the release of an oral (aspirated) stop, and that this timing remains relatively invariant as closure duration changes as a function of speaking rate and stress. For fricatives, on the other hand, the peak glottal opening occurred anywhere from 50 ms. to 110 ms. before the fricative release, and this interval varied linearly with the duration of the fricative constriction. In other words, the peak glottal opening was not tied to the fricative release. Likewise, it was not tightly tied to the fricative onset, as the time between fricative onset and peak glottal opening also varied linearly as a function of fricative duration (although the slope was less steep than for the comparable function at release). The effect of these two linear covariations is to keep the peak glottal opening more or less in the middle of the fricative gesture. These observed patterns are consistent with the second prediction of Kingston's binding principle, amended so that glottal gestures can bind to fricative midpoints (perhaps their point of maximum constriction) rather than either the onset or offset of the constriction. However, the first prediction of the binding principle, that stops should bind 446
On articulatory binding
glottal gestures more tightly than fricatives, is not borne out when stop and fricative gestures "compete" for the same glottal gesture. This situation arises for initial /s/-stop clusters which, in a number of languages (English: Yoshioka et al. 1981; Swedish; Lofqvist and Yoshioka 1980a; Icelandic: Lofqvist and Yoshioka 1980b; Petursson 1977; Danish: Fukui and Hirose, 1983) are produced with a single glottal opening gesture, comparable in magnitude to that found with a single / s / . As Browman and Goldstein (1986) point out, there is a phonological generalization in such languages, that words begin with at most a single glottal gesture. Since we have seen that glottal gestures are coordinated differently with fricative gestures than with stop gestures, we may ask which pattern of coordination wins out when a single glottal gesture is coordinated with a fricative — stop sequence. The binding principle predicts that the stop should win out, since it should bind the glottal gesture more tightly than does the fricative. However, the data (for example, Yoshioka et al. 1981) show that the fricatives tend to win out - the peak glottal opening occurs during the middle portion of the fricative, just as it does for / s / alone (although the peak may be somewhat delayed in the / s / stop case compared to / s / alone). This dominance of the fricative shows up in the "otherwise" condition in principle (lb) above. In English, as in the other languages listed above, the voicing contrast in stops is neutralized in initial /s/-stop clusters. It is possible (as argued by Browman and Goldstein 1986) to view such neutralization as a consequence of the generalization that words may begin with at most one glottal gesture and coordination principle (1). Since the peak of the single glottal opening occurs early in /s/-stop clusters, there is no way of contrasting voicing on the stop without producing a second glottal opening gesture, or a substantially larger opening gesture (which would have a longer duration). However, in defending the binding principle, one might turn this around and argue that because this is an environment of neutralization, the stop's grasp on the glottal gesture is weakened, and therefore the fricative may dominate. However, recent data reported by Munhall and Lofqvist (forthcoming) case doubt on this latter interpretation. They examined laryngeal behaviour in the phrase "kiss Ted," spoken at different rates. They found that at relatively slow speech rates the /s#t/ is produced with two glottal openings, one during the fricative, and one whose peak is at the stop release. When the speaking rate becomes sufficiently fast, however, the two glottal openings merge into one whose peak is during the fricative (although considerably later in the fricative than in the slow speech conditions). Thus, again in this situation, the fricative wins out over the stop, and in this case there is no recourse to neutralization as an explanation for the phenomenon. There are also data that suggest that the second prediction of the binding principle - that glottal gestures should bind to stop releases - is also incorrect. This evidence is provided by the pattern of laryngeal-oral coordination found in voiceless unaspirated stops. Unlike aspirated stops for which peak glottal opening 447
LOUIS GOLDSTEIN
coincides with stop release, peak glottal opening in unaspirated stops occurs during oral closure, and the folds are once again approximated by the time of stop release (e.g. Hindi: Kagaya and Hirose 1975; Dixit 1987; French: Benguerel, Hirose, Sawashima and Ushijima 1978). While it is possible that these early peaks are still coordinated with stop release (as predicted by the binding principle), even though they are not simultaneous with it, the results of Lofqvist (1980) argue against this interpretation, at least for Swedish. He found, rather, that glottal opening for voiceless unaspirated stops is coordinated with respect to stop closure: peak glottal opening occurs a fixed interval after stop closure onset (about 60 ms.), even as the duration of stop closure varied considerably. Again, this result is contrary to the predictions of the binding principle. The failure of the binding principle to account for laryngeal-oral coordination in voiceless unaspirated stops leads us to examine the original rationale for the principle in more detail. Kingston hypothesizes that the distal goal of glottal opening is to help contribute to a high intraoral pressure at the moment of release, which will, in turn, result in a proper explosive burst being produced. However, given that oral pressure buildup can be quite rapid (particularly in the absence of any oral expansion gestures designed to keep intraoral pressure low, cf. Westbury 1983), it is not clear that tight coordination of peak glottal opening with stop release is required to achieve the intraoral pressure characteristic of voiceless stop releases - short periods of glottal opening at the beginning of stop closure may be sufficient for the goal to be met. Pressure buildup may "saturate" quickly. Indeed, Dixit and Brown (1978) have found that peak intraoral pressure in Hindi voiceless unaspirated stops is as high as that found in voiceless aspirated stops. (This is true even for medial voiceless unaspirated stops which, for at least some Hindi speakers, are produced with no observable glottal opening at all.) Thus, glottal opening may contribute to the characteristics of the release as long as it occurs some time during the stop occlusion; the glottal opening does not have to be tightly coordinated with the release to achieve this effect. This view contrasts with Kingston's resolution of this problem (section 23.4), in which he hypothesizes that the burst may be acoustically "weak" for voiceless unaspirated stops. On the basis of data such as that of Dixit and Brown, however, there is no reason to think that their bursts, will be weak. From this point of view, (and this seems to be the direction Kingston is taking in section 23.4), the importance of tight coordination between openings and stop gestures is not to be understood in terms of ensuring adequate pressure for a substantial burst. Rather, the different timing patterns associated with aspirated and unaspirated voiceless stops will result in distinctive acoustic events around the time of the burst (e.g. aspiration or the lack of it). This leads to a somewhat different explanation for Kingston's (1985) phonological examples showing greater "mobility" for glottal gestures associated with continuants (particularly approximants) than with stops. It is not the case that stops demand coordination of glottal 448
On articulatory binding events with their releases, but it is the case that coordinating peak glottal opening with different phases of a stop produces very different acoustic consequences (thereby allowing stops to show the variety of voicing/aspiration contrasts that languages show). On the other hand, coordinating glottal gestures (opening or constriction) with different phases of an approximant (for example, /I/) would be expected to produce fairly similar acoustic consequences — the only difference between them would be one of temporal order, per se. It is more likely, therefore, that such approximant patterns could be confused with one another by listeners than it would be for the comparable stop patterns. This greater confusability could lead, through sound change, to the kind of alternations Kingston describes.
Note This work was supported by NIH grants HD-01994, NS-13870, NS-13617m and NSF grant BNS B520709 to Haskins Laboratories. References Bell-Berti, F. and K. S. Harris. 1981. A temporal model of speech production. Phonetica 38: 9-20. Benguerel, A. P., H. Hirose, M. Sawashima, and T. Ushijima. 1978. Laryngeal control in French stop production: A fiberscopic, acoustic and electromyographic study. Folia Phoniatrica 30: 175-198. Browman, C. P. and L. Goldstein. 1986. Towards an articulatory phonology. Phonology Yearbook 3: 219-252. Clements, G. N. and S. J. Keyser. 1983. CV Phonology: A Generative Theory of the Syllable. Cambridge, MA: MIT Press. Dixit, R. P. and W. S. Brown. 1978. Peak magnitudes of supraglottal air pressure associated with affricated and nonaffricated stop consonant productions in Hindi. Journal of Phonetics 6: 353-365. Dixit, R. P. 1987. Mechanisms for voicing and aspiration: Hindi and other languages compared. UCLA Working Papers in Phonetics 67: 49-102. Fujimura, O. 1981. Elementary gestures and temporal organization - What does an articulatory constraint mean? In T. Myers, J. Laver, and J. Anderson (eds.), The Cognitive Representation of Speech. Amsterdam: North-Holland, 101-110. Fukui, N. and H. Hirose. 1983. Laryngeal adjustments in Danish voiceless obstruent production. Annual Bulletin Research Institute of Logopedics and Phoniatrics 17: 61-71. Goldsmith, J. A. 1976. Autosegmental Phonology. Bloomington, IN: Indiana University Linguistics Club. Kagaya, R. and H. Hirose. 1975. Fiberoptic, electromyographic and acoustic analyses of Hindi stop consonants. Annual Bulletin. Research Institute of Logopedics and Phoniatrics 9: 27^6. Kingston, J. 1985. The phonetics and phonology of the timing of oral and glottal events. Ph.D. dissertation, University of California, Berkeley. Lofqvist, A. 1980. Interarticulator programming in stop production. Journal of Phonetics 8:475^90. 449
LOUIS GOLDSTEIN
Lofqvist, A. and H. Yoshioka. 1980a. Laryngeal activity in Swedish obstruent clusters. Journal of the Acoustical Society of America 68: 792-801. Lofqvist, A. and H. Yoshioka. 1980b. Laryngeal activity in Icelandic obstruent production. Haskins Laboratories. Status Report on Speech Research SR-63/64: 272-292. Lofqvist, A. and H. Yoshioka. 1984. Intrasegmental timing: Laryngeal-oral coordination in voiceless consonant production. Speech Communication 3: 279-289. Munhall, K. & A. Lofqvist. (forthcoming). Gestural aggregation in sequential segments. PA W Review, University of Connecticut. Petursson, M. 1977. Timing of glottal events in the production of aspiration after [s]. Journal of Phonetics 5: 205-212. Westbury, J. R. 1983. Enlargement of the supraglottal cavity and its relation to stop consonant voicing. Journal of the Acoustical Society of America 73: 1322-1326. Yoshioka, H., A. Lofqvist, and H. Hirose. 1981. Laryngeal adjustments in the production of consonant clusters and gemminates in American English Journal of the Acoustical Society of America 70: 1615-1623.
450
26 The window model of coarticulation: articulatory evidence PATRICIA A. KEATING
26.1 26.1.1
Introduction
Phonetics and phonology
Much recent work in phonetics aims to provide rules, in the framework of generative phonology, that will characterize aspects of speech previously thought to be outside the province of grammatical theory. These phonetic rules operate on a symbolic representation from the phonology to derive a physical representation which, like speech, exists in continuous time and space. The precise nature of phonological representations depends on the theory of phonology, but certain general distinctions between phonological and phonetic representations can be expected. Only in the phonology are there discrete and timeless segments characterized by static binary features, though phonological representations are not limited to such segments. Even in the phonology, segments may become less discrete as features spread from one segment to another, and less categorical if features assume non-binary values. However, only in the phonetics are temporal structure made explicit and features interpreted along physical dimensions; the relations between phonological features and physical dimensions may be somewhat complex. Thus phonological representations involve two idealizations. They idealize in time with segmentation, by positing individual segments which have no duration or internal temporal structure. Temporal information is limited to the linear order of segments and their component features. Phonological representations idealize in space with labeling, by categorizing segments according to the physically abstract features. These idealizations are motivated by the many phonological generalizations that make no reference to quantitative properties of segments, but do make reference to categorial properties. Such generalizations are best stated on representations without the quantitative information, from which more specific and detailed representations can then be derived.
451
PATRICIA A. KEATING
Because phonological and phonetic representations are different, the rules that can operate on each must be different. Phonological representations, which are essentially static and categorial, can be acted on by phonological rules, which change or rearrange the features which comprise segments. The output phonological representations can be acted on by phonetic rules, which interpret these features in time and space. Phonetic rules can thus, for example, assign a segment only a slight amount of some property, or assign an amount that changes over time during the segment. The result is a representation which provides continuous time functions along articulatory or acoustic dimensions. In this paper, the representations discussed will be articulatory; they depict articulatory movements in space as a function of time, simply because of the interest of some available articulatory data. No special status for articulatory representations is intended. In considering the relation between segments and the speech signal, phoneticians have always seen coarticulation as a key phenomenon to be explained. Coarticulation refers to articulatory overlap between neighboring segments, which results in segments generally appearing assimilated to their contexts. In the spatial domain, transitions must be made from one segment to the next, so that there are no clear boundaries between segments. Whatever the feature values of adjacent segments, some relatively smooth spatial trajectory between their corresponding physical values must be provided. What mechanisms are available to provide such enriched representations? A new account of the spatial aspect of continuous representations is proposed here. In this account, continuous spatial representations are derived using information about the contextual variability, or coarticulation, of each segment. Though our concern here will be with coarticulation that is phonetic, i.e. quantitative, in character, it must be noted that rules of coarticulation, like other rules, could be either phonological or phonetic in the sense described above. A consequence of work in autosegmental and CV phonology (e.g. Goldsmith 1976; Steriade 1982) is that some segmental overlap can now be represented phonologically. Phonological rules of feature-spreading will produce partial or complete overlapping of segments, including assimilation over long spans. Phonological rules nonetheless have only limited access to segment-internal events. Phonetic rules, on the other hand, can (and more typically will) affect portions of segments, or affect them only slightly, or cause them to vary continuously in quality. The distinction between phonological and phonetic coarticulation is brought out in, for example, discussions of Arabic tongue backing ("emphasis") by Ghazeli (1977) and Card (1979). In previous work, it was largely assumed that this phenomenon was phonological in nature, because the effects of backing could extend over a span of several segments. The phenomenon then appears to be a 452
The window model of coarticulation
prime candidate for an autosegmental account in which one or more feature values (i.e. [ + back]; possibly also [ + low]) spread from certain contrasting segments to other segments in a word. However, Ghazeli and Card, in studies of different dialects and different types of data, both find difficulties with a segmental feature analysis, traditional or autosegmental. In these studies, facts about the gradient nature of contextual tongue backing are presented. The phenomena discussed include partial backing of front segments by back segments and vice versa; weakening (as opposed to blocking) of the spread of backness by front segments; dependence of the amount of backness on distance from the trigger to the target segment. Clearly, categorical phonological rules cannot describe such effects. The difficulties are discussed explicitly by Card. For example, she notes that underlyingly backed segments are "more emphatic" than derived segments are, apparently requiring that phonological and phonetic levels of representation be kept distinct in output. However, neither Card nor Ghazeli actually provides a phonetic analysis of any of these phenomena. Much of the coarticulation literature is confusing on this issue of levels, in that phenomena that are clearly phonetic are often given (unsatisfactory) phonological treatments. In the late 1960s and early 1970s, studies of coarticulation were extended to include effects over relatively long spans. These effects were modeled in terms of spreading of binary feature values; analyses of phonetic nasalization and lip rounding proposed by Moll and Daniloff (1971) and Benguerel and Cowan (1974), for example, were completely phonological in character. Not surprisingly, binary spreading analyses generally proved inadequate (see, for example, Kent and Minifie 1977). The data that were being analyzed were continous physical records, and the analyses were intended to account for such things as details of timing. The analyses failed because such phonological accounts, which make no reference to time beyond linear sequencing, in principle cannot refer to particular moments during segments. The point, though, is not to make the opposite category error by assuming that all coarticulation and assimilation must be phonetic in character. Rather, the point is to determine the nature of each case. What we want, then, is a way of describing those coarticulatory effects which do not involve phonological manipulation of segmental feature values, but instead involve quantitative interactions in continuous time and space. To simplify matters, we will consider only one sub-type of coarticulation; coarticulation involving a single articulator used for successive segments. Coarticulation involving the coordination of two different articulators will not be considered, as further principles are then required for inter-articulator alignment. In single articulator coarticulation, the given articulator must accommodate the spatial requirements of successive segments. If two such requirements are in conflict, they could be moved apart in time (temporal variation), or one of them could be modified (spatial variation). The question, then, is how phonetic rules deal with such situations. 453
PATRICIA A. KEATING
26.1.2
Target models
The traditional, and still common, view of what phonetic rules do is that segmental features are converted into spatio-temporal targets (e.g. MacNeilage 1970), which are then connected up. Segmental speech synthesis by rule typically uses some kind of targets-and-connections model. Targets were formerly seen as invariant, the defining characteristic of a phoneme class. In the process of connecting, target values may not always be reached; e.g. targets may be undershot or overshot due to constraints on speed of movement, thus resulting in surface allophonic variation. The approach in Pierrehumbert (1980) and related work on intonation is in a similar vein, though with an important difference. In this work, target F o values are assigned in time and space by a context-sensitive process called "evaluation." A tone is evaluated with reference to various factors, such as the speaker's current overall pitch range, the phonological identity of the previous tone, the phonetic value assigned to the previous tone, and the particular tonal configurations involved. The use of context-sensitive evaluation, instead of invariant targets, minimizes the need for processes of undershoot and overshoot to deal with systematic deviation of observed contours from targets. While Pierrehumbert believes that crowded tones can give rise to overshoot and undershoot, in cases where tones are sparse their targets are always reached. Targets are connected by rules of "interpolation," which build contours. Interpolation functions are usually monotonic, with target values usually providing the turning points in a contour, and in general the intention is a theory in which speech production constrains interpolations. However, when tones are sparse, interpolations may vary; for example, in the 1980 work on English, "sagging" and "spreading" functions are used to sharpen and highlight F o peaks. A targets-and-connections account of phonetic implementation is only part of a complete phonological and phonetic system. Such a system allows several types of phonological or phonetic contextual influences. Indeed, the system may well be so rich that it is difficult to determine the nature of any single observed effect. First there are the phonological rules which affect (i.e. change, insert, delete) binary feature values; these rules ultimately give rise to gross spatial changes when the feature values are interpreted quantitatively by later rules. Next there is the possibility of context-sensitive evaluation; in Pierrehumbert's scheme for intonation, all evaluation is context-sensitive. An example with segmental features would be that the precise spatial place of articulation of one segment could depend on that of an adjacent segment. This situation differs from a phonological feature change in that the spatial shift would presumably be small. Such contextsensitive "target selection" was used, for example, by Ohman (1967) to account for variation in velar consonants as a function of details of the vowel context; similar rules are often used in speech synthesis (e.g. Allen, Hunnicutt, and Klatt 1987). 454
The window model of co articulation
Another locus of contextual effects is the temporal location of targets. Because spatial values must be assigned to particular points in time, contextual effects could arise from shifts in such time points, rather than the spatial values themselves. For example, a value for one segment could be assigned to a relatively early point in time, far from the value of the following segment. Subtle variation in the timing of targets will produce subtle phonetic effects, e.g. on-glides and offglides to vowels due to consonants. Interpolation between targets results in time-varying context effects. Pierrehumbert (1980) showed how this mechanism could be used to determine much of an intonation contour. When two points are connected up, both of them influence the entire transition between them. This mechanism becomes especially important when the targets to be connected up are located far apart in time. Since English tones, from which intonation contours can be generated, are sparse relative to the syllables of an utterance, the parts of the contour interpolated between the phonetic values of tones play an important role. The same would be true for segmental features if segments may be underspecified throughout the phonetics (Keating 1985). Ohman (1966) used this mechanism, interpolation between sparse values, to produce tongue body coarticulatory effects on consonants, and Fowler (1980) draws on Ohman's model in her own account of vowel and consonant coproduction. In this paper I propose a somewhat different way of viewing the process of building a contour between segmental features. In this new model, variability, both systematic and random, plays a more central role, while targets, and turning points in contours, play a much lessened role.
26.2 The window model of contour construction 26.2.1
Outline of model
I propose that for a given physical articulatory dimension, such as jaw position or tongue backness, each feature value of a segment has associated with it a range of possible spatial values, i.e. a minimum and maximum value that the observed values must fall within. I will call this range of values a window. As will be seen below, this window is not a mean value with a range around that mean, or any other representation of a basic value and variation around that value. It is an undifferentiated range representing the contextual variability of a feature value. For some segments this window is very narrow, reflecting little contextual variation; for others it is very wide, reflecting extreme contextual variation. Window width thus gives a metric variability. There is no other "target" associated with a segment; the target is no more than this entire contextual range. To determine the window for a segment or for a particular feature value, quantitative values are collected across different contexts. Since an overall range of values is sought, maximum and minimum values are the most important. 455
PATRICIA A. KEATING
Therefore contexts which provide extreme values are crucial, and must be found for each segment or class. A window determined in this way is then used to characterize the overall contextual variability of a segment. Windows are determined empirically on the basis of context, but once determined are not themselves contextually varied. That is, a feature value or segment class has one and only one window that characterizes all contexts taken together; it does not have different windows for different contexts. Information about the possibilities for contextual variation is already built into that one window. Note, however, that the phonological feature values that are the basis for window selection need not be the same as the underlying values: phonological rules to change or spread feature values still apply before the phonetics. Thus, in terms of segments, windows are selected for extrinsic allophones rather than phonemes. Windows are given for physical dimensions rather than for phonological features.1 In some cases, the relation between a feature and a physical dimension is fairly direct; the relation between [nasal] and velum position is a standard example. In other cases, the relation may be less direct. Dimensions of tongue and jaw position relate to more than one phonological feature. Thus, phonetic implementation involves interpreting features as physical dimensions in a potentially complex way, though conceivably with the right set of physical dimensions and features this task would be more straightforward than it now seems. Furthermore, the physical interpretation may depend to some extent on other feature values for a given segment. For example, place of articulation may depend somewhat on manner. Thus, in what follows, attributing one window to all instances of a phonological feature value is probably an oversimplification. On a given dimension, then, a sequence of segments' feature values can be translated into a sequence of windows. The process of interpolation consists of finding a path through these windows. Although the relevant modeling remains to be carried out, I assume that this path is constrained by requirements of contour continuity and smoothness, and of minimal articulatory effort, along the lines of minimal displacements or minimal peak velocities. Thus the process of interpolation can be viewed as an optimization procedure which finds smooth functions that fall within the windows. Most of the path must fit into the window, but some part of it will fall within narrow "transition" zones between windows; in the case of adjacent narrow windows, the entire transition will take place quickly between the windows. On this view, the job of "evaluation" (for example, determining turning points in curves) is divided between a mechanism which provides the windows, and the interpolation mechanism. The individual values associated with segments do not exist before an actual curve is built; there are no "targets" or assigned values. Thus whether there is a turning point associated with a given segment depends on the window for that segment and the windows of the context. Windows are ranges within which values forming a path are allowed to fall. 456
The window model of coarticulation C
A
B
b.
Figure 26.1 Illustrations of sequences of windows of various widths. See text for description of each sequence.
Depending on the particular context, a path through a segment might pass through the entire range of values in the window, or span only a more limited range within the window. The paths depend on the context. This is why the window is not taken to be a mean plus a range around the mean. It is not clear how information about a mean value (across all contexts) could be useful in constructing a path for any one context; one could not, for example, constrain a path to pass through the mean, or show the mean as a turning point. In this paper I will offer no explicit procedure for constructing paths through windows, e.g. what functions are possible, whether construction proceeds directionally, how large a span is dealt with at a time. I also leave aside the question of how timing fits into this scheme, e.g. the time interval over which paths are constructed, and whether windows have variable durations, or are purely notional. These are, of course, crucial points in actually implementing the model, but the guiding ideas should be clear. 457
PATRICIA A. KEATING
To show how the model is to work, various combinations of wide and narrow windows, and schematic interpolations through them, are given in figure 26.1. The contours were drawn by eye-and-hand. In this figure, imagine a single articulatory dimension, with each example showing a range of values that spans some subset of the total possible physical range for this articulator. First, in (a), consider cases of a segment with a narrow window between two identical segments different from the middle segment. This middle segment imposes strong constraints on the interpolation, shows little variation across contexts, and affects the interpolation in the adjacent segments. Next, in (b), consider cases of a segment with a wide window between the same identical segments. The wide-windowr segment assimilates its turning point to the context, and in some contexts will show no turning point at all. Finally, in (c) and (d), consider wide and narrow windows between unlike segments. The wide window allows straight interpolations between many different segment types; the narrow window more often makes its own contribution to the curve. Note that the contours shown are not the only possible interpolations through these sequences, since even minimal curves can be moved higher or lower in sequences of wider windows. When only wide windows are found in a sequence, or when a segment is in isolation, multiple interpolations will also be possible. Indeed, the prediction of this model is that speakers, and repetitions of a single speaker, should vary in their trajectories through window sequences which underdetermine the interpolation. Whether such variability is in fact found remains to be seen. It may be that instead there are speaker-specific strategies for limiting variation in such cases, requiring the model to be revised. For example, windows as currently viewed are essentially flat distributions of observed values; instead, empirical distributions could be associated with windows, and used to calculate preferred paths.
26.2.2
An example: English velum position
An example will illustrate how this model is derived from and applied to data. A case from the literature that receives a natural analysis under this model involves velum height in English. It is well-known that the degree of velum opening for [ + nasal] segments, and the degree of velum closing for [ — nasal] segments (both consonants and vowels), varies across segment types and contexts. Thus figure 26.2, after Vaissiere (1983), shows the ranges of values covered by nasal consonants as opposed to oral consonants. These values group together all places of articulation, though Vaissiere goes on to show that velum height varies with consonant place. Thus the windows for individual consonants should be somewhat narrower than these ranges of values. For vowels, velum position is more variable than during either oral or nasal consonants. Again, vowels are discussed here as a group, though again different 458
The window model of coarticulation nominal vertical velum position, one speaker's range
hi—.
lo—'
Figure 26.2 Ranges of values for vertical position of a point on the velum in sentences, based on figures in Vaissiere (1983). nominal vertical velum position, one speaker's range
hi
nominal vertical velum position, one speaker's range
hi-
lo-
lo —
Figure 26.3 Vertical position of a point on the velum in CVN sequences from sentences, after Kent et al. (1974).
vowel phonemes will be expected to have different windows for velum height. The difference between consonants and vowels is traditionally treated as being due to the fact that English consonants, but not vowels, contrast in nasality. Velum position for vowels can vary more because the vowels carry no contrastive value for nasality; velum positions for vowels are more affected by context than are consonant positions. When a vowel precedes a nasal, velum lowering begins at vowel onset, and velum height is interpolated through the vowel, as shown in tracings of velum position in fig. 26.3, after Kent et al. (1974). A window analysis of these facts of English runs as follow. Velum height windows are suggested for oral consonants, nasal consonants, and vowels in figure 26.4. Nasal and oral consonants nearly divide up the range of velum positions, with nasal consonants having lower positions. Vowels have a very wide window, from low to high velum positions, but excluding maximal lowering. Sequences of segmental [nasal] values are represented as sequences of these windows, as in figure 26.4. Contours are derived by tracing smooth paths through these sequences. Vowels, with their wide windows, easily accommodate most interpolations between consonants; any values that would be encountered in interpolating between two oral consonants, or an oral and a nasal consonant, will satisfy the vowel window. 459
PATRICIA A. KEATING
hi—,
lo-J
Figure 26.4 Schematic velum height windows, based on data in figure 26.2 and other data from Vaissiere (1983), with contour.
n
u ^ m
ae n
Figure 26.5 Observed raising of velum for vowel between two nasal consonants, after Kent et al. (1974).
The vowel window will, however, exclude a straight-line interpolation between two nasal consonants with maximally open values; to satisfy the vowel window, a slight raising of the velum would be required. Such raising was noted and discussed by Kent et al. (1974); two of their examples are shown in figure 26.5. It is also apparent in Vaissiere's (1983) data. On the window analysis, then, this slight velum raising is the minimally required satisfaction of the vowel's window. This velum raising is better accounted for by the window model than by two previous approaches. In one approach, the vowel would be assumed to have an "oral," raised-velum, target, but that target would be undershot because of constraints on quickly raising and then lowering the velum. The result of the undershoot would be only a slight raising of the velum, as observed. The problem with this account is that in other contexts, especially that of a nasal followed by an oral consonant, much faster velum movements are in fact observed (e.g. Vaissiere 1983). An undershoot approach cannot easily explain these differences in observed speed of movement. In the other approach, vowels are said to be completely unspecified for nasality and therefore impose no requirements on the velum. However, as Kent et al. (1974) noted, the behavior of a vowel between two nasals is a counterexample to this hypothesis. The vowel instead appears to be specified as "at least weakly oral." The window model states exactly this generalization, in a quantitative fashion. 460
The window model of coarticulation
26.2.3
Consequences
Wide windows, like the velum window for vowels, have an interesting effect in cases where the two sides of the context have very different values for window width and window placement within the dimension. If B is a segment with a wide window, in a sequence of segments ABC, an interpolated trajectory between almost any A and C will satisfy B's window. Thus B's values, and in fact its entire trajectory, will usually depend completely on the context. Yet these are the sort of cases which have been described as resulting from surface phonetic underspecification (Keating 1985). That is, cases of apparent "underspecification" can be seen as cases of very wide windows. True surface underspecification would be equivalent to a window covering the entire range of possible values. In most cases, though, even a wide window will span something less than the entire range of values, and so in at least some context will reveal its own inherent specification. If a segment appears to be unspecified along some dimension, then it should be examined between identical segments with both extremes of values. One or both of these extremes should show the limits, if any, of the putatively unspecified segment's window. An example of this was seen in the case of English vowels and velum position, where vowels in the context of flanking nasals do not appear as unspecified as they do in other contexts.2 The implication of this point is that, in a window theory, phonetic underspecification is a continuous, not a categorial, notion. The widest windows that produce the apparent lack of any inherent phonetic value are simply one extreme; the other extreme is a window so narrow that contextual variation cannot occur. In a window model, it is possible to assign a target that is equivalent to a single point in space — namely, a maximally narrow window — for segments where this is appropriate. However, not all targets need to be specified this narrowly, and indeed the assumption behind the model is that they should not. Instead the model assumes that most segments cannot be so uniformly characterized. In traditional models, each segment is viewed as having an idealized target or variant that is uncontaminated by contextual influences; context systematically distorts this ideal, and random noise further distorts it unsystematically. (The ideal may be identified with the isolation form, but even there it may be obscured because in practice speakers have difficulty performing in this unusual situation.) The window model stands this view on its head. Context, not idealized isolation, is the natural state of segments, and any single given context reduces, not introduces, variability in a segment. Variability is reduced because the windows provided by the context contribute to determining the path through any one window. However, in most cases there will still be more than one possible path through a given sequence of windows, especially at the edges of the utterance. This 461
PATRICIA A. KEATING
indeterminacy is seen as an advantage of the model; it says that speakers truly have more than one way to say an utterance, especially in cases of minimal context. 26.2.4
Another example: English jaw position
Another example of how the window model works, which shows the effects of narrower windows, is provided by jaw position for consonants in some data of Amerman, Daniloff, and Moll (1970). The fact that jaw position is not directly controlled by any single phonological feature suggests that contextual effects on jaw position are unlikely to be phonological in nature. Jaw positions for vowels depend on (tongue) height features, but for consonants are less directly if at all specified by such features. (For example, the jaw is generally high for coronals, especially alveolars, but is relatively low for velars, though these are usually described as [ + high].) Given this difference between vowels and consonants, phoneticians expected effects of vowels on consonant jaw position to extend over long time spans, and looked at data to see just how far such effects could extend. Amerman et al. (1970) looked at the influence of an open vowel /ae/on jaw position in various numbers of preceding consonants. Because they wanted to look at maximal sequences of consonants in English, the first consonant was almost always / s / . They used the distance between the incisors as observed in X-ray motion pictures as the measure of jaw opening. Data analysis consisted of noting during which consonant the jaw lowering gesture for /x/ began. In over 90% of their cases, this consonant was the / s / . They concluded that jaw lowering for /ae/ extends over one or two preceding consonants up until an / s / : Results for jaw lowering show that coarticulation of this gesture definitely extends over two consonants [sic] phonemes preceding the vowel /ae/, probably irrespective of ordinary word/syllable positions. This result could have presumably been extended to four consonants, had the / s / phoneme not shown itself unexpectedly to be contradictory to jaw lowering. [... ] coarticulation of a gesture begins immediately after completion of a contradictory gesture. Amerman et al. did not describe jaw lowering as a feature, or use the term "feature-spreading" to describe coarticulation of jaw position. However, some aspects of their discussion certainly suggest this kind of analysis. They share with feature-spreading analyses two central ideas: first, that coarticulation is the anticipation of an upcoming gesture, and second, that contradictory gestures block this anticipation. In the case at hand, it was expected that a jaw lowering gesture for a vowel would be anticipated during previous consonants, and it was found, surprisingly, that / s / is contradictory to such a lowering gesture. This finding could easily be given a feature-spreading formulation: a low vowel had a feature for low jaw position which will spread to preceding unspecified segments; / s / is 462
The window model of co articulation
2L consonant specified with a feature for a high jaw position which therefore blocks the spreading of the vowel's low position feature. While Amerman et al. did not give such a formulation, Sharf and Ohde (1981), in their review of coarticulation, do: they cite this study, plus Gay (1977), and Sussman, MacNeilage, and Hanson (1973), as showing "feature spreading and shifts in target position" for the jaw. Is jaw position during consonants, then, an anticipation of either a gesture or a feature value for jaw lowering? In the sample figures given by Amerman et al. (as well as in similar figures in Keating 1983) it seems clear that jaw position is continuously changing between the closest position, associated with the entire / s / , and the most open position, associated with the extreme for /ae/. An example is shown in figure 26.6. Any consonants between / s / and /ae/ are affected equally by those two extreme positions; it might as well be said that there is left-to-right carryover assimilation of jaw raising from the / s / , to which /ae/ is contradictory. Even /ae/ shows some effect of the / s / , since much of the /ae/ is spent reaching the extreme open position. In fact, then, both extreme endpoints appear important to the intermediate segments, in that both determine the trajectory from high to low position. Intermediate segments "assimilate" in the sense that they lie along the (curved) interpolation. In these terms, it makes little sense to ask "how many segments" lowering can "coarticulate across." Instead, we want to know which segments provide which extreme values for an articulatory dimension such as jaw lowering. The data presented are not sufficient for a window analysis. Determining the contribution of each segment to the overall contour requires information about the variability of each segment type, yet Amerman et al., with their different experimental goal, examine only one type of segment sequence. From this kind of data we cannot conclude anything about window widths. As it happens, data that address this hypothesis with respect to jaw lowering are available in unpublished work done by me with Bjorn Lindblom and James Lubker. Our experiment recorded jaw position over time in one dimension in VCV tokens. The vowels were /i,e,a/ and were the same in any one token; the consonants were /s,t,d,r,l,n,b,f,k,h/. Although we recorded both English and Swedish speakers, I will discuss only the five English speakers here; each speaker produced each item six times. The measurements made were maximum opening for the vowels, and maximum closing for the consonants, that is, the extreme positions in a VCV. Measuring such extreme positions is relatively straightforward, though it may not represent the full range of possible variation. These data allow us to ask whether a given consonant has a fixed value, or a variable value, between vowels of different degrees of openness. What we find is that in VCV's, both the vowel and consonant extreme jaw positions vary as a function of the other, but vowels vary more than consonants. This result can be stated more generally by saying that overall higher segments (segments whose average position is higher) vary less than overall lower segments. 463
PATRICIA A. KEATING
nominal jaw position
Figure 26.6 Vertical position of a point on the jaw in /strae/, after Amerman et al. (1970).
jaw position across
2-
vowel
4-
contexts, in mm.
6-
8-
I
10-
Figure 26.7 Range of mean extreme values for jaw position for four segments, from data of Keating et al. (1987).
jaw position
across vowel contexts, in mm.
2-
4-
6-
8-
s
t
r
ae
Figure 26.8 Sequence of schematic jaw windows for /strae/, with contour.
464
The window model of coarticulation
Figure 26.7 shows mean variability of the consonants of interest (/s,t,r/) and the low vowel / a / . For each of these segments in each of its contexts, an average extreme value was calculated across speakers and repetitions. The variability shown here indicates the minimum and maximum averages obtained in this way. From these data, windows for jaw position for each of these segments are proposed in figure 26.8: high and narrow for / s / , similar but slightly wider for /t/, middle and medium for /r/, and low and wide for /ae/. In the figure, the windows are shown in sequence for /strae/, together with a contour traced through these windows. The window for / s / is the narrowest and thus exerts the most influence on the contour. The other consonants have windows that place them along a smooth trajectory from the / s / to the vowel.
26.3
General discussion and conclusion
The window model is a proposal about how successive segments are accommodated in building a continuous contour along a single articulatory dimension. Two examples have been presented: one in which an articulatory dimension (velum position) is controlled by a single phonological feature ([nasal]), and another in which an articulatory dimension (jaw position) is less directly related to any single phonological feature (e.g. [high]). In this section, some general points about the proposal will be discussed. 26.3.1
Underspecification
The window model extends and refines an earlier proposal about phonetic underspecification. In Keating (1985) I claimed that formant trajectories during / h / are determined by surrounding context, with the / h / contributing no inherent specification of its own, other than the glottal state. I suggested that / h / be analyzed as having no values for oral features even in phonetic representation, with interpolation alone providing the observed trajectories. Under this proposal, phonetic underspecification was viewed as a carryover into surface representation of phonological underspecification, and thus was an all-or-none possibility. Now, however, the degree of phonetic specification is differentiated from phonological underspecification. A wide window specifies relatively little about a segment, while a narrow window gives a precise specification, and all intermediate degrees are possible. Thus, for example, with respect to phonetic nasality, English vowels, with their wide but not maximal window, are "not quite unspecified." Window width is to some extent an idiosyncratic aspect that languages specify about the phonetics of their sounds and features. The default rules and phonetic detail rules of a language will be reflected in window widths. However, it seems likely that, overall, phonetic variability will in part be a function of phonological contrast and specification. Thus it can be proposed that only phonologically unspecified features will result in very wide windows; if the two contrasting values 465
PATRICIA A. KEATING
of a feature were each assigned wide-open windows, the "contrast" could hardly be maintained. Furthermore, window width may derive in part from the specification, or lack of it, of phonological features. It has already been noted that more than one feature may be reflected in a given physical dimension; if a segment is specified for all of the features relating to a particular dimension, then that segment could well have a narrower window for that dimension than a segment which was specified for only one of the relevant features. More generally, the more features a segment is specified for, the more narrow windows it is likely to have for the various dimensions defining the overall phonetic domain.3 Thus the more specified a segment, the smaller its total share of the phonetic domain, so that contrasting segments tend to occupy separate areas in that domain. Another independent consideration that may influence window width is revealed in the data on jaw position variability. Jaw positions that were lower on average were also more variable. Possibly variability may be better measured on a different scale, one using percentages rather than absolute values for position. In any event, to the extent that the dimensions along which variation is measured may not be strictly linear, nonlinguistic constraints on variation may be found.
26.3.2 Variability The window model expresses the observation that some segments vary more than others along various articulatory dimensions. Previous work has also addressed the issue of segment variability. Bladon and his colleagues (Bladon and Al-Bamerni 1976; Bladon and Nolan 1977) proposed an index of coarticulatory resistance to encode the fact that some segments are relatively insensitive to context, while others vary greatly as a function of context. However, Bladon's index was construed as a separate segmental "feature" with numerical values. That is, a given segment would have its values for the usual phonetic features, plus a value for the coarticulatory resistance feature. Thus the coarticulatory resistance indicated variability around some norm derived from the phonetic feature values. The basic insight of coarticulatory resistance is preserved in the window model: high coarticulatory resistance corresponds directly to a narrow window, and a lack of coarticulatory resistance corresponds directly to a wide window. But in the window model, the variability is not represented separately from observed, modal, or target values. Furthermore, it is values for features, rather than values for unanalyzed segments, which are related to numerical variability. Lindblom (1983) employed a notion related to coarticulatory resistance and general segment incompatibility, called coarticulatory propensity. Using the idea that some segments are more prone to coarticulate with neighbors, Lindblom proposed that segment sequences are generally ordered so as to minimize conflicts due to incompatibility. In effect, then, Lindblom's incompatibility was related to incompatibility as used in feature-spreading models of coarticulation, where 466
The window model of coarticulation
features spread until blocked by segments having the opposite value for the spreading feature or specification for other features incompatible with it. The difference is that Lindbolm's blocking was gradient in nature; segments would more or less block coarticulation. However, Lindblom did not quantify his dimension; indeed, it is not obvious how this could be done. More recently, Manuel and Krakow (1984) and Manuel (1987) have discussed variability in terms of what they call output constraints. The idea here is that the size and distribution of a phonemic inventory determines the limits on each phoneme's variability. Phonemes are represented as areas or regions, not as points, in a phonetic domain such as a two-dimensional vowel graph (as also for Schouten and Pols 1979). The output constraints essentially say that no phoneme can intrude into another phoneme's area. Thus the size of the inventory strongly influences the size of each phoneme's area, and its possible contextual variability. In this work, the focus is on variation of an entire segment class within its physical domain, for example, all vowels in the vowel-formant domain. The constraints may be taken as properties of the class more than of the individual segments. Furthermore, no explicit mechanism is given for relating the constraints to the process of phonetic implementation. Manuel follows Tatham (1984) in saying simply that the phonetics "consults" the phonology to ascertain the constraints. The window model, in contrast, gives an account of how the phonetic values derive from the representation of variability. First, window width is related to (though not equated with) feature specification, making variability more a property of individual segments than of segment classes per se. Second, window widths are the basis of path construction, providing variable outputs for combinations of segments and contexts. A hypothesis related to output constraints was discussed by Keating and Huffman (1984) and by Koopmans-van Beinum (1980, and later work), namely, that a language might fill the available vowel domain one way or another - if not with phonemes, then with allophonic variation. These authors place more emphasis than Manuel does on languages allowing extensive overlap among phonemes, that is, not constraining the output very severely. Thus, for example, afive-vowellanguage like Russian, with extensive vowel allophony and reduction, will show much more overlap among vowels than will five-vowel languages like Japanese. The window model allows this kind of arbitrary variation, since a language can have wider or narrower windows for all feature values. However, as noted above, there is a sense in which inventory will generally affect window widths: inventories affect the degree of phonological feature underspecification, and underspecification probably generally results in wide windows. The output constraints model described by Manuel (1987) differs in another way from the window model. Output constraints are constraints on variability around a target or modal value for each phoneme, and phonemes are seen as having "canonical" variants. In addition, careful speech is hypothesized to involve 467
PATRICIA A. KEATING
production of these canonical variants. The window model includes no such construct, and indeed in earlier discussion the notion of a canonical "isolation form," distorted in contexts, was rejected. Nonetheless, Manuel raises an important issue that future experimentation should address. 26.3.3
Co articulation
Returning to our initial impetus, how does the window model differ from a targetsand-connections model in providing mechanisms for coarticulation ? In the targets-and-connections model, coarticulation at the phonetic level could result from two mechanisms, determination of target values, and interpolation between them. Considering only spatial values, quantitative evaluation as a coarticulatory mechanism is trivialized in the window model. In this model, evaluation consists of looking up the appropriate window. Only narrow windows say anything about the values to be associated with a segment, and turning points in the contour have no special status in characterizing contextual effects on segments. Instead, contour building is done almost entirely by the equivalent of interpolation, which is nearly as powerful in this model as in the old. The hope is that replacing specific rules of context-sensitive evaluation with windows in effect tightens the model and eliminates some potential ambiguities in the targets-and-connections framework. 26.3.4
Conclusion
In this paper some aspects of continuous spatial movements have been examined for two articulatory dimensions, velum and jaw. When segments are combined in sequences, some are more variable, and thus more accommodating to context, than others along a given articulatory dimension. The window model presented here uses those differences in variability as the basis for deriving the continuous spatial movements. In effect, the feature values for segments are translated in the phonetics into more or less specified quantitative values along the articulatory dimensions. There are many aspects of this proposal that obviously require expanded and more precise formulations. More generally, there are many known cases of coarticulation and assimilation that the model must be tested against. What I have tried to do here is to provide a framework and terminology in which contextual variability and continuous representations can be discussed. The proposal is offered at this point with a view to such discussion and future development. Notes This work was supported by the NSF under grant BNS 8418580. I would like to thank Abigail Cohn, Carol Fowler, Bruce Hayes, Marie Huffman, Kenneth Stevens, and the editors for their comments. 1 In this paper, windows are discussed only in terms of certain articulatory measurements. However, it seems plausible to construct similar windows from variation in acoustic 468
The window model of coarticulation
measurements, such as formant frequencies, since these are often used as physical representations. Nonetheless, the nonlinearity of articulatory-acoustic and acousticperceptual mappings means that variation in one domain will not translate directly into variation in another. 2 I have seen other examples in the acoustic domain, where a particular formant will reveal a "target" value only in extreme contexts. Thus the second formant for intervocalic [s] interpolates between the surrounding vowels, unless both vowels have very low, or very high, second formants. In such cases the [s] is shown to have its own value in the vicinity of 1,700 to 2,000 Hz, depending on the speaker. 3 Such phonetic domains are usually called spaces, e.g. the vowel space. This term is avoided here and later to minimize confusion with the notion of spatial dimension used through this paper. References Allen, Jonathan, Sheri Hunnicutt, and Dennis Klatt. 1987. From Text to Speech. Cambridge: Cambridge University Press. Amerman, James, Raymond Daniloff and Kenneth Moll. 1970. Lip and jaw coarticulation for the phoneme /x/.
Journal of Speech and Hearing Research 13:
147-161. Benguerel, A.-P. and H. Cowan. 1974. Coarticulation of upper lip protrusion in French. Phonetic a 30: 41-55. Baldon, R. A. W. and A. Al-Bamerni. 1976. Coarticulation resistance of English /I/. Journal of Phonetics 4: 135-150. Bladon, R. A. W. and Francis Nolan. 1977. A videofluorographic investigation of tip and blade alveolars in English. Journal of Phonetics 5: 185-193. Card, Elizabeth. 1979. A phonetic and phonological study of Arabic emphasis. Ph.D. dissertation, Cornell University. Fowler, Carol. 1980. Coarticulation and theories of extrinsic timing. Journal of Phonetics 8: 113-133. Gay, Thomas 1977. Articulatory movements in VCV sequences. Journal of the Acoustical Society of America 62: 183-193. Ghazeli, Salem. 1977. Back consonants and backing coarticulation in Arabic. Ph.D. dissertation, University of Texas. Goldsmith, John. 1976. Autosegmental phonology. Ph.D. dissertation, MIT. Distributed by Indiana University Linguistics Club. Keating, Patricia. 1983. Comments on the jaw and syllable structure. Journal of Phonetics 11: 401-406. Keating, Patricia. 1985. Phonological patterns in coarticulation. Paper presented LSA Annual Meeting, Seattle. MS in preparation under new title. Keating, Patricia and Marie Huffman. 1984. Vowel variation in Japanese. Phonetica 41: 191-207. Keating, Patricia, Bjorn Lindblom, James Lubker, and Jody Kreiman. 1987. Jaw position for vowels and consonants in VCVs. MS in preparation. Kent, Raymond, Patrick Carney and Larry Severeid. 1974. Velar movement and timing: Evaluation of a model for binary control. Journal of Speech and Hearing Research 17:470-488. Kent, Raymond and Frederick Minifie. 1977. Coarticulation in recent speech production models. Journal of Phonetics 5: 115—133. Koopmans-van Beinum, Floria. 1980. Vowel contrast reduction: An acoustic and 469
PATRICIA A. KEATING
perceptual study of Dutch vowels in various speech conditions. Dissertation, University of Amsterdam. Lindblom, Bjorn. 1983. Economy of speech gestures. In Peter MacNeilage (ed.) The Production of Speech. New York: Springer-Verlag. MacNeilage, Peter. 1970. Motor control of serial ordering of speech. Psychology Review 11: 182-196. Manuel, Sharon. 1987. Acoustic and perceptual consequences of vowel-to-vowel coarticulation in three Bantu languages. Ph.D. dissertation, Yale University. Manuel, Sharon and Rena Krakow. 1984. Universal and language particular aspects of vowel-to-vowel coarticulation. Haskins Laboratories. Status Report on Speech Research SR77/78: 69-78. Moll, Kenneth and R. Daniloff. 1971. Investigation of the timing of velar movements during speech. Journal of the Acoustical Society of America 50: 678-694. Ohman, Sven. 1966. Coarticulation in VCV utternaces: spectrographic measurements. Journal of the Acoustical Society of America 39: 151-168. Ohman, Sven. 1967. Numerical model of coarticulation. Journal of the Acoustical Society of America 41: 310-320. Pierrehumbert, Janet. 1980. The phonology and phonetics of English intonation. Ph.D. dissertation, MIT. Schouten, M. and L. Pols. 1979. Vowel segments in consonantal contexts: a spectral study of coarticulation - Part I. Journal of Phonetics 7: 1-23. Sharf, Donald and Ralph Ohde. 1981. Physiologic, acoustic, and perceptual aspects of coarticulation: Implications for the remediation of articulatory disorders. In Norman Lass (ed.) Speech and Language: Advances in Basic Research and Practice.
Vol. 5. New York: Academic Press. Steriade, Donca. 1982. Greek prosodies and the nature of syllabification. Ph.D. dissertation, MIT. Sussman, Harvey, Peter MacNeilage, and R. Hanson. 1973. Labial and mandibular movement dynamics during the production of bilabial stop consonants: Preliminary observations. Journal of Speech and Hearing Research 16: 397^20. Tatham, Marcel. 1984. Towards a cognitive phonetics. Journal of Phonetics 12: 37^47. Vaissiere, Jacqueline. 1983. Prediction of articulatory movement of the velum from phonetic input. MS. Bell Laboratories.
470
27 Some factors influencing the precision required for articulatory targets: comments on Keating s paper KENNETH N. STEVENS
1 Introduction The framework that Keating has presented for describing articulatory movements has a number of desirable features. It incorporates the view that the positioning of articulatory structures to implement a particular feature or feature combination can occur within a certain range of values or "window," and that the movements that occur when there is a feature change need only follow trajectories that do not violate these windows. Whether, as Keating suggests, there is no preference for different target positions within a window might be open to debate, since there is at least some evidence that certain acoustic characteristics of clear speech show greater strength or enhancement than those for conversational speech (Picheny, Durlach, & Braida 1986). That is, there appears to be some justification for the notion that the strength of the acoustic correlate of a feature may vary, and one might conclude, then, that some regions within a window (as defined by Keating) are to be preferred over others. In Keating's framework, a target region is characterized not only by a range of values for an articulator but also, through the specification of horizontal parallel lines, by a time span within which the articulatory dimension must remain as it follows a trajectory defined by a feature change. The implication is that an articulatory structure should remain within a target region over a time interval in order to implement adequately the required feature or feature combination. Thus, for example, a narrow window would require an almost stationary position for an articulator over a substantial time span. Examination of articulatory movements suggest, however, that during speech articulatory structures rarely remain in a fixed position for very long, except when two structures are approximated, as for a stop consonant. A possible modification of Keating's proposal would be to specify windows with particular widths, as she suggests, but to require that these windows be reached or passed through and not necessarily maintained over a time interval. Interpolation between targets would be determined through some
KENNETH N. STEVENS
smoothing principle derived in part from physiological constraints on movement trajectories. If one accepts the idea that there may be a range of target positions for an articulatory structure when implementing a particular feature combination, then it is natural to ask what factors govern the specification of windows. Keating uses empirical data to arrive at a determination of the windows, but in her paper she hints at some of the factors that may actually determine window size. Two such factors will be touched on here. One of these derives from the fact that, for a given degree of precision in the acoustic (or perceptual) representation of a segment or feature, there is considerable variability in the required degree of precision in the placement of various articulatory structures. The other factor arises as a consequence of underspecification for some features in the representation of segments, such that adequate perceptual distinctiveness can be achieved without tightly constraining the acoustic or articulatory specification of the segments. Keating makes several observations on the apparent window size for velum height for nasal consonants and vowels. She notes that in English velum height appears to show less variability for consonants than for vowels, but that the velum undergoes some raising when a vowel occurs between two nasal consonants. It is known that the velopharyngeal opening that produces a given degree of nasalization of a vowel is dependent on tongue height for the vowel. A relatively small opening gives rise to a certain amount of modification of the spectrum in the first formant region for high vowels, whereas a much larger opening is needed to produce about the same amount of acoustic change for low vowels. Across all vowels, then, the apparent window size for velum height would appear to be large, but for a given vowel it would be much smaller. When a nasal consonant is produced adjacent to a vowel in English, information about the manner and place of articulation of the consonant is carried in the few milliseconds or tens of milliseconds adjacent to the consonantal implosion or release. The required abruptness of the acoustic change at the implosion or release is a consequence of the rapid shift from an acoustic output that is entirely from the nose to an output that is primarily from the mouth opening. For a nasal vowel, the proportion of sound output from nose and mouth depends on the ratio of the area of the velopharyngeal opening to the minimum cross-sectional area of the vocal tract in the oral region. For a rapid acoustic switch to occur at the release of the nasal consonant, the velopharyngeal opening at the instant of the release should not be too great, since otherwise the principal output would continue to be from the nose in the time interval immediately following the release. Some setting of the velopharyngeal opening is desirable, therefore, at the instant of release of the consonant, and this setting would tend to be a reduction from the wider opening that is optimal for the nasal murmur. Acoustic requirements on the consonant release (or implosion), then, would dictate that the velum be raised in a vowel
472
Some factors influencing the precision required for articulatory targets
between nasal consonants. An additional constraint might be that since vowel height is represented in the sound less clearly for a nasal vowel than for a non-nasal vowel, a reduction in the velopharyngeal opening during the vowel will strengthen the implementation of the height feature for the vowel. A number of other examples can be cited for which the range of values that are permitted for an articulatory parameter is determined by acoustic considerations. In the case of vowel configurations, it follows from acoustic theory that the shift in formant frequencies that results from a particular change in the cross-sectional area at some point along the vocal tract depends on the cross-sectional area at that point and on the overall vocal tract shape. Thus, for example, when a uniform vocal tract shape is used to produce a vowel, changing the area of the anterior 1 cm section of the tract (i.e. at the lips) by 1 cm2 (from, say, 3 cm2 to 4 cm2) will shift the third formant frequency by about 1 percent. On the other hand, for a vowel opening such as / i / , a change in the area of the lip opening of only 0.1 cm2 (from, say, 0.5 cm2 to 0.6 cm2) also causes a shift of about 1 percent in the third formant frequency. Thus, it is not unexpected that there will be large differences in the effective window size for the mouth opening depending on vowel height. The tongue position for a velar stop consonant is another case where significant variability can be tolerated without influencing appreciably the acoustic requirement of a prominent mid-frequency peak in the burst spectrum. While variation in the position of the point of contact between the tongue and hard or soft palate of 1-2 cm will give rise to bursts with a range of frequencies, the relevant compactness property of the burst remains the same. In the case of a stop consonant produced with the tongue blade, however, the position of the point of contact of the tongue tip with the hard palate must be adjusted to be within a few millimetres of the alveolar ridge in order to produce the appropriate acoustic property. The jaw is another articulatory structure that must be controlled with various degrees of precision, as Keating notes, and again acoustic requirements probably play a role in determining the window size. As noted above, the precision required for a mouth opening (and presumably for tongue height in general) is relatively lax for low vowels, since the influence of change in these parameters on the acoustic output is small. Thus jaw position will have a rather wide window, in Keating's terms, for low vowels. On the other hand, the lower incisors must be positioned relatively accurately for a strident fricative consonant such as / s / or / s / , since these incisors provide an obstacle against which the jet of air from the tonguepalate constriction impinges. Positioning of the incisors in this way is necessary to generate the required large amplitude turbulence noise (Amerman, Daniloff and Moll 1970). Reasonable precision of jaw opening is also necessary for American English /r/, at least as produced by most speakers, since the dimensions of the 473
KENNETH N. STEVENS
sublingual space must be adjusted to produce a relatively low natural frequency of the front cavity, leading to the low third formant frequency characteristic of this sound. The examples we have given up to this point have suggested that the size of a window is influenced in part by non-monotonicities in articulatory-acoustic relations; varying degrees of precision in adjusting articulatory structures or states might be required to achieve a particular degree of acoustic (or perceptual) precision. As Keating has indicated, window size for a given articulator may also depend on the degree to which particular features are specified in the underlying representation of the utterance. Some features are distinctive, and hence their implementation may be more tightly constrained, whereas others are redundant, and there may be little or no constraint on their implementation. Predictions of the effects of such redundancy or underspecification on acoustic or articulatory "window size" are needed in order to quantify a model such as Keating's. An initial attempt to produce such predictions for vowels has been made by Manuel and Krakow (1984) and by Manuel (1987). These authors provide evidence to show that, in a language with a relatively sparse vowel system, the formant frequencies for a vowel may be influenced substantially by a vowel in an adjacent syllable, whereas the acoustic manifestation of a similar vowel is more tightly constrained when the language has a rich vowel system. In Keating's framework, the size of the window for the vowel would be different in the two languages, and as a consequence, perceptual separation would be maintained between the vowels. A number of other examples can be given for which underspecification in the feature representation of an utterance leads to variability or lack of precision in an articulatory or acoustic target. A weak voiced fricative such as /v/ or /d/ in English is often produced in intervocalic position as a sonorant, without appreciable increase of intraoral pressure. This implementation is presumably a consequence of the fact that the feature [sonorant] is not distinctive for these segments. Or, a coronal stop consonant such as / t / or / d / preceding a retroflex consonant / r / can be actualized with a point of closure over a range of points near but immediately posterior to the alveolar ridge. This range of places of articulation is possible because the feature [anterior] is not distinctive for apical stop consonants in English. Just because a particular feature does not appear to play a distinctive role in a particular context does not mean, however, that the acoustic or articulatory manifestation of this feature is free to undergo large variation, or to be characterized by a wide window. Rather than showing wide variation, the acoustic or articulatory manifestation of a feature of this kind may be quite restricted when it is implemented with a particular group of features. This preferred manifestation appears to be selected in such a way that it enhances the implementation of the other features. Thus, for example, vowels are normally produced as non-nasal unless the feature [nasal] is distinctive for vowels in the language. The non-nasal 474
Some factors influencing the precision required for articulatory targets
implementation of vowels strengthens the acoustic manifestation of the features [high] and [low]. Or, non-low back vowels are normally produced with rounding, presumably to enhance the feature [back], although of course in many languages the feature [round] is also used distinctively. Examples such as these suggest that it may be difficult at this time to develop a theory that will predict the precision or window size required for positioning articulatory structures for sequences of segments in an utterance. The concepts of underspecification or redundancy are elusive in their application to speech, and further research is needed to quantify them at the phonetic level. In the meantime, ideas of the kind proposed by Keating provide a framework that guides further research leading to quantitative models of speech production.
References Amerman, J. D., R. Daniloff, and K. L. Moll. 1970. Lip and jaw coarticulation for the phoneme /ae/. Journal of Speech and Hearing Research 13: 147-161. Manuel, S. Y. 1987. Acoustic and perceptual consequences of vowel-to-vowel coarticulation in three Bantu languages. Ph.D. dissertation, Yale University. Manuel, S. Y. and R. A. Krakow. 1984. Universal and language particular aspects of vowel-to-vowel coarticulation. Haskins Laboratories Status Report on Speech Research. SR 77/78: 69-78. Picheny, M. A., N. I. Durlach, and L. D. Braida. 1986. Speaking clearly for the hard of hearing II: Acoustic characteristics of clear and conversational speech. Journal of Speech and Hearing Research 29: 434-449.
475
28 Some regularities in speech are not consequences of formal rules: comments on Keating's paper CAROL A. FOWLER
28.1 The general enterprise I will comment first on the general enterprise that Patricia Keating outlines in her paper and then on her discussion of coarticulation, including specifically her proposals of target "windows" and interpolation mechanisms for generating articulatory contours. Here and elsewhere, Keating suggests that much of phonetics, previously considered outside the domain of a grammar, in fact, deserves incorporation into the grammar as a set of rules: "More and more, the phonetics is being viewed as largely the same sort of creature as the phonology, that is, a formal set of rules." (Keating 1985a: 19) The rules take the nonphysical, symbolic, phonological representation and derive "a more physical representation" (This volume, p. 451). One motivation for this elevation of phonetics has been the failure of universal phonetics. Most phonetic generalizations are found to have exceptions, and therefore, it seems, they are of linguistic interest. Further, Keating suggests (1985a) that study of phonetic regularities reveals some patterns evident in certain languages that resemble phonological patternings in other languages (For example, whereas many languages show 20-30 ms of acoustic vowel shortening before a voiceless consonant, in some languages including English, the duration difference is much larger and, apparently, has been phonologized as a length difference.) The finding that languages may exhibit more-or-less the same patterning, but at different levels of representation in the language suggests to Keating that phonetic regularities are "not something divorced from the rest of the grammar, but something controlled as part of the grammar" (1985a: 19). To the extent that the view of phonetics that Keating has put forth spurs serious investigation of phonetic regularities, and their similarities to phonological patterns, I applaud it. If we are to develop realistic theories of phonology then we have to understand phonetic regularities and their relation to phonological ones. However, I have to disagree with, or at least express skepticism concerning, two inferences Keating draws from the non-universality of nonetheless popular 476
Some regularities in speech are not consequences of formal rules
phonetic regularities and from similarities between some phonological and phonetic processes. One inference is that, if a generalization is not universal, it should, therefore, be incorporated into the grammar as a formal rule. The other is that analogous patterning in phonological and phonetic processes implies that the two kinds of rules are "the same sort of creature," best expressed as formal rules. My disagreement stems from three considerations, one relating to the kinds of regularities that belong in grammars, one to the relation between phonology and phonetics, and one relating, finally, to the rich possibilities that articulatory dynamics may offer for understanding some systematicities in speech. 28.1.1
What belongs in a grammar?
In a recent paper, referring to Pierrehumbert's (1980) intonational rules of "evaluation" (context-sensitive target assignment), Keating writes that u such phonetic rules are linguistic, not physical, in nature, and so should be considered part of the grammar" (1985a: 21). For reasons I will outline under section 28.1.2, I do not think "linguistic" and "physical" are descriptions of the same type (any more than are "in a flood of tears" and "in a sedan chair" in Ryle's example of a category mistake [1949: 22]: "She came home in a flood of tears and a sedan chair") that can be contrasted as they are in Keating's comment above. However, the other point, that the grammar should include description only of those regularities that are linguistic (and not due to articulatory dispositions of the vocal tract, for example) seems quite right. More than that, in my view, the only grammars worth trying to describe are grammars that capture the prototypical practical linguistic knowledge of language users. A grammar for the lower echelon of the dual patterning in languages (Hockett 1960), then, should include generalizations that express more-or-less the patternings that a talker has to exhibit by means of vocal-tract gestures, in order for a listener to recover a linguistic message from them. Moreover, the linguistic patterns presumably are the ones the talker has to impose on the vocal tract - that is, they are what a talker has to do "on purpose," because only voluntarily imposed patterns can be used, voluntarily, to convey linguistic information. But then we must be very careful to distinguish what talkers do on purpose from what just happens anyway. I will suggest in section 28.1.3 that the existence of exceptional languages that do not show an otherwise popular systematic patterning is not a sufficient criterion for elevating the popular property into the grammars of languages that do show it. The popular pattern may not be controlled even if the exceptional one is. 28.1.2
The relation of phonology to phonetics and articulation
In one respect, Keating's view of the relation of phonology to phonetics and, ultimately to articulation, is conventional. From the talker's perspective, a 477
CAROL A. FOWLER
linguistic message-to-be starts out as a linguistic, nonphysical, description that ultimately, in her words, is made "more physical." Abstract, discrete, static, serially-ordered segments become complex, interwoven, dynamical gestures of the vocal tract. I see the relation of phonological descriptions to utterances differently. In fact, I think that there is something impossible about the familiar view that linguistic utterances start out nonphysical and somehow become physical later. Rules that take a description in one domain (e.g. phonological) into another description in another domain (e.g. phonetic) are linguistic things; they are statements composed of symbols. We can have rules that take symbols into other symbols, but we can't have a rule that takes a symbol into a thing (that is, into something physical). Of course we can transform a symbol that does not refer to something physical into a symbol that does. But how can we ever, rulefully, leave the realm of symbols and get the vocal-tract moving? I don't think that we can. (I have the same objection to psycholinguistic theories of language production such as MacKay's [1982], that terminate in "motor commands." Commands are linguistic things, but the neurons supposedly obeying the commands are not in the command-understanding business.) How does a speech plan ever make something happen in the vocal tract? We need to find another way to understand the relation of linguistic patterns to articulatory activity. I suggest the following alternative. Linguistic structure, phonological structure in particular, is a patterning that, during speech, is realized in the vocal tract. A talker's task can be seen as one of achieving that realization. If talkers plan utterances, as many sources of evidence suggest that they must do, then the linguistic structure to be realized as vocal-tract activity is planned, perhaps in the talker's brain. The brain, like the vocal tract, is a physical system capable of embodying structure. In order to realize a speech plan as vocal-tract activity, there is no need for the patterning to be made more physical. Rather, the patterning in one physical medium has to be replicated in another, presumably via causal connections between the two media. Thus, transformations involved in realizing a speech plan are largely "horizontal" ones in which the levels of abstraction of linguistic structure in the plan are maintained in the vocal tract. There are no vertical transformations of a sort that take abstract, nonphysical linguistic patterns into something physical. Accordingly, there is no need for rules to achieve those kinds of vertical transformations. In short, I propose that a phonological segment is an abstract component of a linguistic utterance that is always physically instantiated. Its natural home is the vocal tract. A spoken phonological segment is two kinds of thing at once that participates in two kinds of events simultaneously - the one a linguistic communicative event, and the other an articulatory one. For the first event in which a phonological segment participates - the linguistic communicative one — only certain of its properties are relevant. They are the 47 8
Some regularities in speech are not consequences of formal rules
abstract systematic patternings used by the linguistic community for communicative purposes. For example, to do its work in constituting part of a lexical item, a phonological segment must have a particular gestural constituency (captured in generative phonology by binary feature values, but soon-to-be captured better, I think by the abstract gestures in Browman and Goldstein's articulatory phonology 1986, this volume). In addition, the gestures of the segment must be mutually cohesive and separable from the gesture complexes of other segments of the word. Cohesion and separation are achieved by serial ordering of feature columns one column per segment, in generative phonology. These properties are abstract, but not in the sense that they exist only in the minds of language users. Instead, they are evident in articulation but only from a perspective on the activity that is abstracted from the detailed articulatory dynamics. For the second event in which a phonological segment will participate, an articulatory event in a vocal tract, of course the detailed dynamical properties of the segment are relevant. They are the properties that instantiate the gestural constituency of the segment, the cohesion among the gestures of a segment and their separability from gestures of other segments by means of coarticulatory sequencing. By hypothesis, the relation between the two events is hierarchical (cf. Pattee 1973). The phonological patterning serves as a set of control constraints harnessing the articulatory dynamics to its ends.1 Where does phonetics fall in a system in which phonological patterns serve as abstract constraints on articulatory dynamics ? That is, are phonetic generalizations regular processes that talkers impose on activity of the vocal tract to express a linguistic message, or are they reflections of articulatory dynamics per se ? Quite likely, there is no general answer to this question. I suggest, however, that we not rule out the second answer in particular instances either on grounds that a phonetic generalization is not universal, or on grounds that some phonological and phonetic patterns are similar and so, by inference, are patternings of the same type, appropriately expressed by formal rules. Consider shortening of a vowel before a voiceless consonant. A great number of languages exhibit this systematic property (Chen 1970); some do not (Keating 1985b). Data presented by Chen (1970) show that the 20-30 ms of shortening that many languages show probably is due to the more rapid closing gesture for a voiceless stop as compared to a voiced stop. Chen relates that, in turn, to the greater muscular force needed to maintain closure of a voicelss stop given the higher intraoral pressure for voiceless than voiced stops. Imagine that Chen's account is correct. Then we do not need a formal phonetic rule to realize the 20-30 ms of vowel shortening in languages that exhibit the property. We may need something in languages that counteract the effects of the consonant, presumably by lengthening the vowel. But there is no need for a rule to realize the "default" condition because it would happen anyway without such a rule. 479
CAROL A. FOWLER
What of languages such as English in which the durational difference is much too large to be explained by Chen's account ? And what, more generally, of the analogies we see between certain phonological and phonetic regularities including the phonological and phonetic vowel shortening before voiceless consonants just considered, compensatory lengthening (a phonological regularity described, for example, by Hayes, paper presented at the Laboratory Phonology conference) and compensatory shortening (a complementary phonetic regularity; see, e.g. Fowler 1981a, 1983), vowel harmony and vowel-to-vowel coarticulation, etc. Do the similar patternings of these phonological and phonetic regularities imply that they are u the same sort of creature"? I don't think so. The phonological processes, of course, are grammatical; however, I suspect that, generally, the analogous phonetic ones are not. The similarities between the processes may provide information concerning ways in which certain phonological processes enter the language (cf. Ohala 1981). Consider compensatory lengthening and shortening, for example. Compensatory lengthening is a real change in the phonological length of a segment, say a vowel, when another segment, often a consonant or a schwa, is somehow lost. The increase in phonological length is achieved by real articulatory lengthening of the vowel. For its part, compensatory shortening is a reduction in the measured acoustic duration of a vowel when a consonant or unstressed syllable is added to its neighborhood. I doubt that this measured acoustic shortening is realized by any real articulatory shortening of the vowel, however (see Fowler, Munhall, Saltzman, and Hawkins 1986, in which vowel shortening with added consonants is examined). The measurable acoustic duration of the vowel shortens only because some of its articulatory trajectory now coincides with that for the neighboring consonant or unstressed vowel. Whereas in the absence of the consonant or unstressed vowel, that part of the vowel counts as a measurable part of the vowel, now it is classified as "coarticulation" and is measured as part of the consonant. What is the relation between compensatory lengthening and compensatory shortening then? They are strikingly similar. Hayes (Laboratory Phonology conference paper) finds an asymmetry in contexts for compensatory lengthening. It never occurs when a consonant or schwa preceding a target vowel is lost, but only when following consonants or schwas are lost. He alsofindsthat compensatory lengthening of a vowel can also be triggered by loss of a nonadjacent following consonant. Both effects are regular properties of compensatory shortening. The measurable acoustic durations of vowels shorten substantially when consonants are added to a syllable rhyme (Fowler 1983) or when an unstressed syllable is added to a foot (Fowler 1981a). They shorten little when consonants are added to a syllable onset (Fowler 1983), and little or not at all when unstressed syllables are added before a stressed vowel (Fowler 1981a). Moreover, measured vowel shortening occurs if a consonant is added to a VC syllable, even if it is not adjacent to the vowel. 480
Some regularities in speech are not consequences of formal rules
I suspect that the similarity occurs because some phonological processes enter the language when language learners misinterpret acoustic consequences of systematic articulatory behaviors as reflecting some grammatical property - that is, as intended productions by a talker. (This is, approximately, Ohala's view [e.g. 1981] that listeners may be sources of sound change in language.) In the case of compensatory lengthening, language learners may misinterpret the acoustic consequences of a loss of coarticulatory overlap, when a segment is lost or weakened, as true lengthening (rather than as uncovering) of the segment. When they produce the lexical items themselves, they produce truly lengthened segments. If the language community tolerates these errors - perhaps because they add redundancy to lexical forms - they enter the language as phonological processes. From this perspective, then the parallels between some phonological processes and some phonetic generalizations do not signify that the two kind of pattern are the "same sort of creature." They only tell us how the phonological processes may have entered the grammar.
1.3
The richness of articulatory dynamics
I can detect in Keating's discussion of what is "physical" in language production a prejudice about it that I think is mistaken, but widely accepted in linguistic theorizing. It is that, because the vocal tract is a physical system that is the same for speakers of all languages, the only kinds of systematicities for which it can be responsible must be universal and not of linguistic interest. But that is not quite right. In speech production, talkers' vocal tracts are organized for speech - and in particular, for whatever phonological segments are being produced. The organization is phonological. To produce a recognizable linguistic message, talkers have to realize phonological segments and processes in activity of the vocal tract. The organizations are different for different segments and so they are also different for speakers of different languages that have different segment inventories. Those organizations themselves will generate their own systematic consequences in the articulatory behaviors of talkers. For example, consider bilabial stop production. Over different vocalic environments, the bilabial constriction is achieved with different contributions of the jaw and lips (e.g. Keating, Lindblom and Lubker, cited in Keating, this volume). Specifically, the jaw will do less and the lips more for bilabial stops coarticulating with low vowels than for the same stops coarticulating with higher vowels. The negative relation between contributions of the jaw and lips is beautifully systematic and could be expressed by formal rule, but that would miss the point as to why that relation shows up. The phonology specifies a gesture (constriction at the lips) that is realized in the vocal tract by means of a set of very low-level physiological biasings in the vocal tract. 481
CAROL A. FOWLER
These biasings create, temporarily, an organization among relevant articulators (lips and jaw) with the remarkable property of " equifinality " - that is a tendency to reach a relatively invariant end (bilabial closure) by a variety of routes. That equifinality is achieved far down in the nervous system and musculature is shown by perturbation experiments in which the jaw is suddenly braked (by means of a torque motor) while raising for bilabial closure. Within 20-30 ms of the perturbation, the upper lip responds with increased lowering (Kelso, Tuller, Vatikiotis-Bateson, and Fowler 1984). Talkers are not applying a rule that sums jaw and lip closing movements; the response to perturbation is much too fast for that. Rather, talkers have implemented an organization of the musculature that serves as a "special-purpose device" to realize a gestural requirement of bilabialstop production. The general point is that some systematic properties we observe in speech are impositions by talkers so that their vocal tract activity can convey a linguistic message. Others may be more-or-less automatic, more-or-less necessary consequences of those impositions. Given an organization of the vocal tract to realize some phonological properties, these systematicities will occur as a consequence. Representing them in the grammar is undesirable both because, given a vocal tract organized for speech, they are fully redundant with the systematicities that the grammar must express, and because they misrepresent the origin in the vocal tract itself of these systematic properties. Notice too that these automatic (" dispositional") consequences of vocal-tract organization will depend on how the vocal tract is organized. Because the organizations realize the phonological segments of a particular language, the dispositional systematicites to which they give rise need not be universal. Therefore, we cannot use universality as a necessary criterion to identify contributions of the articulatory system. Rather, just as Keating's work promotes looking closely at the phonetics to better understand its role in language, we must also look more closely at articulatory dynamics in speech to see what kinds of things talkers control and which ones just happen given the controls.
28.2
Coarticulation and assimilation
If, as I have suggested, there are two general sources of systematicities in speech, the one grammatical - somehow imposed by the talker - and the other byproducts of those impositions, how can we tell which is which ? The best way, clearly, as just noted, is to study articulatory control. Another approach, adopted by Saltzman (1986; Saltzman and Kelso 1987) is to model articulatory organizations to see what they can do to generate some observed systematic properties of speech behavior. Before we decide that a systematic property of speech deserves expression in the grammar as a formal rule, we should try to rule 482
Some regularities in speech are not consequences of formal rules
out as an alternative, that it is not directly controlled by talkers, but rather is a consequence of articulatory organizations for speech. A shorter, more rule-of-thumb, approach than either to these, is perhaps to ask the listener whether a property is grammatical. As I suggested above, following Ohala (1981), some phonological processes may leak into languages via misperception by listeners. They attend to the acoustic consequences of dispositional articulatory regularities as if they were consequences of regularities imposed on articulation intentionally. When misperception does not happen, however, in some respects, listeners are remarkably insensitive to the acoustic consequences of dispositional regularities. But they must be sensitive to the consequences of grammatical processes. For example, no matter how hard we concentrate, we cannot make ourselves aware of the acoustic differences between the "intrinsic allophones" (Wang and Fillmore 1961) of / d / in /di/ and /du/ even though research shows that we use those context-sensitive parts of the acoustic signal to identify consonantal place (e.g. Blumstein, Isaacs, and Mertus 1982; Walley and Carrell 1983). In contrast, we can be made aware of the durational differences between the vowels in "heat" and "heed." Ordinarily, listeners hear through the details of coarticulatory sequencing to their linguistic significance - that is, to the serially ordered activations of organizations of the vocal tract that constitute the sequences of phonological segments being produced by the talker. On this criterion, I suspect that many processes identified as coarticulatory will count as nongrammatical, whereas true assimilations are grammatical. Consider first coarticulation, and vowel-to-vowel coarticulation in particular. During a medial consonant of a VCV, gestures of the tongue body move from a posture for the first vowel toward a posture for the second (Barry and Kuenzel 1975; Butcher and Weiher 1976; Carney and Moll 1971; Ohman 1966; Perkell 1969). Acoustic and articulatory evidence also suggests that articulatory movement for the second vowel may begin during the first vowel (Bell-Berti and Harris 1976; Fowler 1981a, 1981b; Manuel and Krakow 1984). The acoustic effects on the first vowel appear assimilatory to a speech researcher because, for example, the second formant for the first vowel is displaced toward that for the second vowel. However, the effects are not assimilations. As Keating described for phonetic coarticulation generally, the effects are gradient and may only affect part of the first vowel. Furthermore, listeners to the speech do not hear the coarticulatory effects as assimilatory; they hear them as talkers produce them - as gestures for the second vowel. In /ibabi/ and /ababa/, acoustic signals in the time frame of the schwas are quite distinct (Fowler 1981a, 1981b). Pulled from context, the schwa in /ba/ produced in the context of the high front vowels sounds high itself, like / b l / , and listeners distinguish it readily from schwa produced in the context of /a/. In context, 483
CAROL A. FOWLER
however, the syllables sound alike, and both sound like neutral /ba/. The reason they sound alike, however, is not that the context somehow masks the coarticulatory effects on schwa. Listeners are sensitive to them (Fowler and Smith 1986). Asked to identify the final vowels of /Vbabi/ and /Vb9ba/ as u e e " or u ah," listeners respond faster if the coarticulatory information in the preceding schwa is predictive of the final vowel than if it is misleading (due to cross-splicing). Listeners use the coarticulatory effects of a context vowel on schwa as information for identifying the coarticulating vowel. On the other hand, they do not integrate this information with information for the reduced vowel whose gestures co-occur with it in time. For listeners, that is, coarticulatory effects are not assimilatory; rather they signal the talkers' serial ordering of neighboring phonological segments. This contrasts with true assimilations, such as vowel harmony, for example. In vowel harmony, a vowel in a morpheme appears in different surface forms depending on its vocalic neighborhood. The form that appears is more similar on certain dimensions to its neighboring vowels than are the other forms in which the vowel may appear in other neighborhoods. Obviously, native listeners must be able to hear the differences in the vowels that appear in different contexts, because they are different phonological segments of the language. I suggest that this difference in listeners' perceptions of coarticulation and assimilation reflects a real difference in the phenomena being described. The one is a dynamic articulatory process that (for the most part, see below) just happens when phonological segments are serially ordered in speech; the other is a grammatical process under intentional control by a talker and, hence, available to introspection by a listener. What of cross-language differences in coarticulation? Possibly these are grammatical differences among languages. Left unchecked, processes for sequencing the organizations of the vocal tract for successive phonological segments may allow extensive overlap. Languages may curtail overlap among vowels for languages with crowded vowel spaces (e.g. Manuel and Krakow 1984) because, as previously noted, listeners are not perfectly reliable in hearing through coarticulatory overlap. Similarly, they may curtail transconsonantal vowel-tovowel coarticulation where it might impair perceptual recovery of essential gestural information for a consonant (see Keating's examples of vowel-to-vowel coarticulation in Russian versus English [1985a] and Recasens' investigations of vowel-to-vowel and vowel-to-consonant coarticulation in Catalan [1984a, 1984b]). Donegan and Stampe (1979) have suggested that phonological processes are those natural dispositions that languages selectively fail to suppress. I am suggesting an almost complementary idea that systematic behaviors that would occur without explicit intentional control are not grammatical. However, language-specific inhibitions of those dispositional behaviors are. 484
Some regularities in speech are not consequences of formal rules
28.3 Windows and contour building Because I view coarticulation as a dynamic, fundamentally nongrammatical process, I am inclined to look for another way to explain the phenomena that led Keating to propose the ideas of windows and an interpolation mechanism to generate articulatory contours. The concept of windows focuses on the differential variability that an articulator may show in contributing to the realization of a phonological segment as vocal tract activity. From a slightly more global perspective on vocal tract activity, that variability can be seen in the context of a remarkable goal-invariance produced by the cooperative behavior of multiple articulators. In the example of bilabial-stop production considered earlier, the positioning of the jaw during closure for a bilabial stop is variable across different vocalic contexts. It can be variable only because, thanks to the special-purpose device previously described, the lips can "compensate" for perturbations to the jaw caused by gestures for coarticulating segments that also use the jaw. In instances in which articulators are coordinated in this way, focusing on the individual articulators is focusing too far down in the organized articulatory system to understand why articulators adopt variable positions for the same segment in different contexts. In Saltzman's (1986) task-dynamic model, gestural goals for bilabial stop production (its phonological description in vocal tract organizational terms) do not specify the particular contributions that relevant articulators will make. Rather, the goal for constriction, for example, specifies an organization among the jaw and lips that isflexibleexcept in respect to the goal of bilabial closure. The system can achieve closure by a variety of combinations of contributions from the jaw and lips. In a real vocal tract (but not yet in task dynamics), the combination realized in a particular context will depend in part on demands made on the articulators by coarticulating segments. So a low vowel will lower the jaw during bilabial closure, causing the lips to do more of the work of closing than they do in the context of a high vowel. The apparent size of an articulatory "window" will be due largely to the possibilities for compensation that the vocal tract organization offers for perturbations by other articulators. Possibly, for / s / , the jaw has to do most of the work in achieving the appropriate constriction because there is little or no satisfactory compensation possible from other articulators. Cross-language differences in apparent window size (for example, differences between English and French in velum lowering during oral vowels) presumably are differences in the phonological specification of the segments. Whereas velum lowering may be part of the phonological specification for a French oral vowel, perhaps it is not for an English vowel. The findings that Keating summarizes do 485
CAROL A. FOWLER
not convince me that velum posture is controlled in English vowel production. That the velum raises in an English NVN sequence does not mean in itself that a window for the velum has been specified as part of the phonological specification for a vowel. Perkell (1969) and others describe a speech "posture" of the vocal tract that talkers adopt before beginning to speak. I have suggested that the posture serves as a sort of speech-specific rest posture of the vocal tract and respiratory system (Fowler 1977). It includes raising the velum from its posture during rest breathing to a position still more open than its postures for oral consonants. Possibly the velum migrates back to this posture where organizational constraints constituting the nasals weaken. They would be weakest in mid-vowel. As for the interpolation mechanism, once again, the question from my perspective is whether articulatory contours between segments are regulated at all or, alternatively, whether they can be seen as systematic consequences of other controls implemented during speech. We know that gestural components of phonological segments (such as bilabial closure) are realized by cooperative organizations among articulators (e.g. Kelso et al. 1984). We presume that organizations for the gestures of different phonological segments are realized in overlapping time frames, giving rise to coarticulation. I have speculated that the activation of a set of organizational relationships among articulators first grows and weakens so that, for example, the influence that a vowel exerts on the jaw is greater during opening for the vowel than during achievement of the constriction for a neighboring consonant. I would speculate further that there is no interpolation of articulatory contours between segments even though effects of the vowel on the jaw can be seen there. Indeed, if the organizations among articulators to achieve gestural goals as proposed by the task-dynamics model and as evidenced in the perturbation study by Kelso et al. for bilabial closure are realistic, then in many instances there can be no mechanism for interpolating demands on an articulator by neighboring segments because individual articulators are rarely individually controlled. Instead, most likely, contouring happens as a by-product of overlapping activations of organizations of the vocal tract that make demands on some of the same articulators.
28.4
Summary and conclusions
I have argued that we need to develop our understanding of articulatory organization during speech and to use that understanding to limit the rules we propose for the phonological and phonetic components of the grammar. We need that understanding, in particular, if we are to figure out what properly to do with systematic phonetic properties of the language in a grammar. Keating is quite right to reject an earlier view that phonetic generalizations will be universal because vocal tracts are about the same everywhere. However in my opinion, she rejects the view in part for the wrong reasons. She points out that most phonetic 486
Some regularities in speech are not consequences of formal rules
generalizations have exceptions and so the assumption that generalizations will be universal is wrong. She also argues that because phonological processes in some languages resemble certain systematic phonetic properties of others. Therefore, phonological and phonetic generalizations are of the same sort and should be characterized in the same way by formal rules. I have tried to suggest that physical explanations for systematicities in speech will not be limited to a few universal consequences of the fixed structure of the vocal tract. The vocal tract is organized for speech and many of its systematic properties may be consequences of those organizations. The organizations can be language-specific. Moreover, for those phonetic generalizations that are popular across languages, but have exceptions, quite possibly, the "default" behavior is uncontrolled, and it is only in the exceptional languages that a grammatical rule need be invoked. Generally, I would recommend assigning grammatical status to phonetic generalizations only if there is positive evidence that the regularity has to be imposed intentionally by a talker. Notes Preparation of this manuscript was supported by NICHD grant HD01994 and NINCDS Grant NS13617 to Haskins Laboratories and by a fellowship from the John Simon Guggenheim Foundation. I thank Catherine Browman, Elliot Saltzman and George Wolford for comments on an earlier draft of the manuscript. 1 A useful guide to understanding the kind of relationship that I propose holds between phonologies and articulation is Pattee's characterization of biological systems generally (e.g. Pattee 1973, 1976). Pattee attempts to explain how biological systems can at once be physical systems "obeying" physical laws, just like anything else, and yet at the same time they can embody abstract constraints that harness the physical dynamics in the service of biological and psychological functions. My guess is that the phonology serves as a set of control constraints on articulatory dynamics. Accordingly, the relation of the phonology to more physical descriptions is not that of a prior, abstract, nonphysical description from which the articulatory dynamics are derived by rule. Instead, it serves as ongoing constraints on articulatory activity that ensures that the activity is linguistically patterned. References Barry, W. and H. Kuenzel. 1975. Co-articulatory airflow characteristics of intervocalic voiceless plosives. Journal of Phonetics 3: 263-282. Bell-Berti, F. and K. Harris. 1976. Some aspects of coarticulation. Haskins Laboratories Status Reports on Speech Research 45/46: 197-204. Blumstein, S., E. Isaacs, and J. Mertus. 1982. The role of gross spectral shape as a perceptual cue to place of articulation in stop consonants. Journal of the Acoustical Society of America 72: 43-50. Browman, C. and L. Goldstein. 1986. Towards an articulatory phonology. In C. Ewan and J. Anderson (eds.) Phonology Yearbook 3: 219-254. 487
CAROL A. FOWLER
Browman, C. and L. Goldstein, (this volume) Tiers in articulatory phonology, with some implications for casual speech. Butcher, A. and E. Weiher. 1976. An electropalatographic investigation of co-articulation in VCV sequences. Journal of Phonetics 4: 59-74. Carney, P. and K. Moll. 1971. A cinefluorographic investigation of fricative-consonant vowel coarticulation. Phonetica 23: 193-201. Chen, M. 1970. Vowel length variation as a function of the voicing of the consonant environment. Phonetica 22: 129-159. Donegan, P. and D. Stampe. 1979. The study of natural phonology. In D. Dinnsen (ed.) Current Approaches to Phonological Theory. Bloomington, Indiana: Indiana University Press, 126-173. Fowler, C. 1977. Timing Control in Speech Production. Bloomington, Indiana: Indiana University Linguistics Club. 1981a. A relationship between coarticulation and compensatory shortening. Phonetica 35-50. 1981b. Production and perception of coarticulation among stressed and unstressed vowels. Journal of Speech and Hearing Research 127-139. 1983. Converging sources of evidence for spoken and perceived rhythms of speech: Cyclic production of vowels in sequences of monosyllabic stress feet. Journal of Experimental Psychology: General 112: 386-412. Fowler, C , K. Munhall, E. Saltzman, and S. Hawkins. 1986. Acoustic and articulatory evidence for consonant-vowel interactions. Paper presented at the 112th Meeting of the Acoustical Society. Fowler, C. and M. Smith. 1986. Speech perception as "vector analysis:" An approach to the problems of segmentation and invariance. In J. Perkell and D. Klatt (eds.) Invariance and Variability of Speech Processes. Hillsdale, NJ: Lawrence Erlbaum Associates, 123-135. Hayes, B. Paper presented at the Laboratory Phonology Conference. Hockett, C. 1960. The origin of speech. Scientific American 203: 89-96. Keating, P. 1985a. The phonology-phonetics interface. UCLA Working Papers in Phonetics 62: 14-33. Universal phonetics and the organization of grammars. In V. Fromkin (ed.) Phonetic linguistics: Essays in Honor of Peter Ladefoged. Orlando: Academic Press, 115-132. (this volume) The window model of coarticulation: Articulatory evidence. Kelso, J. A. S., B. Tuller, E. Vatikiotis-Bateson, and C. Fowler. 1984. Functionallyspecific articulatory cooperation following jaw perturbations during speech: Evidence for coordinative structures. Journal of Experimental Psychology: Human Perception and Performance 10: 812-832. MacKay, D. 1982. The problems of flexibility, fluency and speed-accuracy tradeoff. Psychological Review 89: 483-506. Manuel, S. and R. Krakow. 1984. Universal and language particular aspects of vowelto-vowel coarticulation. Haskins Laboratories Status Reports on Speech Research 11778: 69-78. Ohala, J. 1981. The listener as a source of sound change. In C. S. Marek, R. A. Hendrick and M. F. Miller (eds.) Papers from a Parasession on Language and Behavior. Chicago: Chicago Linguistic Society, 178-203. Ohman, S. 1966. Coarticulation in VCV utterances: spectrographic evidence. Journal of the Acoustical Society of America 39: 151-168. Pattee, H. H. 1973. The physical basis and origin of hierarchical control. In H. H.
488
Some regularities in speech are not consequences offormal rules Pattee (ed.) Hierarchy Theory: The Challenge of Complex Systems. New York: Braziller, 71-108. Pattee, H. H. 1976. Physical theories of biological coordination. In M. Grene and E. Mendelsohn (eds.) Topics in the Philosophy of Biology. Dordrecht-Holland: Reidel, 153-173. Perkell, J. 1969. Physiology of Speech Production: Results and Implications of a Quantitative Cineradiographic Study. Cambridge, Ma: MIT Press. Pierrehumbert, J. 1980. The phonology and phonetics of English intonation. Ph.D. dissertation, MIT. Recasens, D. 1984a. V-to-C coarticulation in Catalan VCV sequences: An articulatory and acoustic study. Journal of Phonetics 12: 61-73. 1984b. V-to-V coarticulation in Catalan VCV sequences. Journal of the Acoustical Society of America 76: 1624—1635. Ryle, G. 1949. The Concept of Mind. New York: Barnes and Noble. Saltzman, E. 1986. Task dynamic coordination of the speech articulators: A preliminary model. In H. Heuer and C. Fromm (eds.) Generation and Modulation of Action Patterns. Experimental Brain Research Series 15. New York: Springer-Verlag, 129-144. Saltzman, E. and J. A. S. Kelso. 1987. Skilled actions: A task dynamic approach. Psychological Review 94: 84-106. Walley, A. and T. Carrell. 1983. Onset spectra and formant transitions in the adult's and child's perception and place in stop consonants. Journal of the Acoustical Society of America 73: 1011-1022. Wang, W. and C. Fillmore. 1961. Intrinsic cues and consonant perception. Journal of Speech and Hearing Research 4: 130-136.
Index of names
Abercrombie, D., 201, 206 Abramson, A., 147, 370, 408 Aissen, J., 286, 290, 292 Al-Bamerni, A., 466 Allen, J., 277, 454 Allen, W., 326, 327, 436 Alstermark, M., 112 Amador, M., 441 Amerman, J., 462, 473 Anderson, M., 155, 277 Anderson, S., 370, 403 Archangeli, D., 21, 33 Aurbach, J., 359
Borowsky, T., 290, 301, 318 Braida, L., 471 Brevad-Jensen, A.-C, 112 Browman, C , 11, 12-14, 97, 321, 341, 373, 395, 398, 406, 407, 423, 428, 431, 445, 446, 447, 479 Brown, G., 359, 361, 362, 369, 370 Brown Jr., W., 266, 448 Bruce, G. 7, 38, 97, 109, 111, 112 Briicke, E., 437 Brugmann, K., 258 Buckley, E., 271 Butcher, A., 483
Baer, T., 205, 400 Bannert, R., 112 Barker, M., 326 Barry, M., 359, 361, 366, 370 Barry, W., 483 Beckman, M., 4, 6, 7, 9-10, 31, 33, 42, 55, 104, 150, 156, 176, 181, 182, 216, 278 Beddor, P., 268, 441 Bell, A., 293, 294, 300, 326, 327 Bell, A. M., 437 Bell, H., 438 Bell-Berti, F., 445, 483 Bender, B., 326 Benguerel, A.-P., 410, 448, 453 Bhatia, T., 410 Bickley, C , 418 Bird, C , 218, 235, 236 Bladon, R. A. W., 266, 466 Bloomfield, L., 325 Blumstein, S., 402, 483 Bolinger, D., 37, 38, 150 Booij, G., 180 Borden, G., 357 Borgstrom, C , 327
Card, E., 452-3 Carlson, R., 277 Carney, P., 459-60, 483 Carrell, T., 483 Carter, D., 213 Catford, I., 326, 362 Chan, J., 271 Chao, Y.-R., 438 Chen, M., 180, 479 Chiba, T., 439 Chien, E., 271 Chomsky, N., 215, 259, 270, 277, 284, 285-6, 403, 441 Christdas, P., 286, 298, 322 Church, K., 277 Clements, G. N., 4, 7, 11-12, 13, 14, 33, 36, 38, 39, 44, 54, 59, 60, 66, 69, 70, 218, 219, 250, 254, 255, 292, 293, 294, 298, 302, 309, 320, 322, 326, 329, 346, 350, 351, 357, 382, 383, 384, 385, 392, 407, 445 Cobler, M., 17, 19, 21, 33, 65, 69 Cohen, J., 104, 105, 414 Cohen, P., 104, 414 Cohen, P., 366 491
INDEX OF NAMES Coker, C , 379 Collier, R., 97, 408 Cooper, C , 271 Cooper, F., 342, 370 Cooper, W., 4, 37, 46, 50, 152, 154, 174, 176, 191, 195, 201, 203, 204, 206, 209 Coppieters, R., 326 Costa, D., 271 Cowan, H., 453 Creider, C , 68 Crick, F., 280 Cutler, A., 10, 210, 211, 213 Dalby, J., 359, 369 Daniloff, R., 453, 462, 473 Davidsen-Neilsen, N., 317, 380 Delattre, P., 440 Dell, F., 293, 309, 326 Devine, A., 286, 328 Dewey, G., 301 Dixit, P., 448 Dommeln, W. van, 133 Dolcourt, A., 271 Donegan, P., 484 Dorman, M., 261 Dressier, W., 370 Durand, M., 268 Duranti, A., 297 Durlach, N., 471 Eady, S., 176, 206 Edwards, J., 9-10 Elmedlaoui, M., 293, 326 Emeneau, M., 326 Engstrand, O., 112 Eriksson, Y., 112 Eukel, B., 147 Ewan, W., 128, 266 Fant, G., 266, 277, 286, 435, 439 Filip, H., 271 Fillmore, C , 483 Flanagan, J., 267, 281 Flege, J., 426 Fletcher, H., 326 Foley, J., 286, 290 Ford, K., 407 Fowler, C , 9, 15, 152, 202, 206, 342, 352, 353, 406, 455, 480, 482, 483, 484, 486 Fromkin, V., 438 Fujimura, O., 12, 13, 216, 261, 290, 303, 321, 326, 327, 348, 352, 365, 377, 379, 380, 398, 445 Fujisaki, K , 37, 38, 39, 43, 54 Fukui, N., 447
Gamkrelidze, T., *38 Grading, E., 38-9, 137 Games, S., 410 Gay, T., 357, 370, 408, 463 Gee,J., 155,201 Gelfer, C , 205 Ghazeli, S., 452-3 Gimson, A., 362, 369 Gobi, C , 277 Goldsmith, J., 270, 346, 382, 407, 445 Goldstein, L., 11, 12-14, 97, 266, 268, 321, 341, 373, 395, 398, 406, 407, 423, 425, 428, 431, 441, 445, 446, 447, 479 Grammont, M., 284 Granstrom, B., 277 Greenberg, J., 283, 296, 309, 328, 438 Grimm, J., 402, 437 Grosjean, F., 155, 201 Guy, G., 366, 368 Hale, K., 180, 388, 389 Halle, M., 215, 258, 259, 270, 277, 284, 285-6, 290, 293, 346, 387, 389, 403, 435, 436, 441 Hankamer, J., 286, 290, 292 Hanson, R., 463 Hardcastle, W., 366 Harris, J., 301, 317, 318 Harris, K., 205, 445, 483 t'Hart, J., 50, 97 Hattori, S., 182 Hawkins, S., 480 Haynes, B., 18, 180, 301, 317, 322, 352, 385, 407, 480 Heffner, R.-M. S., 326 Henke, W., 406 Hermann, E., 327 Hertz, S., 10-11, 14, 215, 216, 255 Hirose, H., 370, 408, 410, 426, 428, 446, 447, 448 Hirose, K., 37, 38 Hirschberg, J., 38, 154, 169 Hockett, C , 477 Hombert, J.-M., 128, 266 Honda, K., 370 Hooper, J., 286, 290 House, A., 261 Householder, F., 261 Huang, C.-T. J., 36, 44, 59 Huffman, F., 326 Huffman, M., 467 Huggins, A., 152, 206 Hulst, H. van der, 318 Hunnicutt, S., 277, 454 Hutchison, J., 326 Hutters, B., 428, 429 Hyman, L., 20, 21, 33, 38, 63, 196, 297 492
INDEX OF NAMES
Inkelas, S., 6, 17, 19, 21, 22, 31, 33, 65, 69, 70 Isaacs, E., 483 Ishida, H., 348 Ishizaka, K., 267, 281
Ladd, D. R., 6-7, 35, 36, 37, 41, 46, 51, 55, 140, 145, 147, 150 Ladefoged, P., 13, 293, 294, 327, 346, 400, 401, 402, 403, 418 Laeufer, C , 302 Laughren, M., 69 Leben, W., 6, 17, 19, 22, 26, 33, 65, 69, 385, 388, 407 Lehiste, I., 191, 195, 201, 206, 210, 211, 213,
Jaeger, J., 289, 326 Jakobson, R., 286, 325, 435, 436 Jasanoff, J., 326 Javkin, R , 268, 438 Jesperson, O., 284, 285, 289, 290, 325, 326, 437 Johnson, C , 51 Jonasson, J., 266 Jones, D., 59 Juret, A., 392
256
Lejeune, M., 327 Lekach, A., 292 Leung, E., 66 Lewis, J., 366 Liberman, A., 261, 342 Liberman, M., 5, 36, 37, 38, 44, 47, 48, 49, 50, 150, 155, 170, 196, 206, 277 Lieberman, P., 4 Liljencrantz, J., 304 Lin, Q,, 277 Lindblom, B., 266, 267, 290, 304, 352, 370, 463, 466, 481 Lisker, L., 147, 370, 408 Lofqvist, A., 410, 426, 428, 445, 446, 447, 448 Lornetz, J., 268, 440, 441 Lotts, D., 176 Lovins, J., 290, 321, 326, 327, 352, 380 Lowenstamm, J., 309, 317, 326 Lubker, J., 463, 481 Lyberg, B., 112, 414
Kadin, J., 216, 255 Kagaya, R., 410, 448 Kahn, D., 299, 302, 327, 359 Kajiyama, M., 439 Kakita, Y., 379 Kante, M., 236 Karplus, K., 216, 255 Katamba, F., 196 Kawasaki, H., 265, 268, 290, 400 Kay, B., 373 Keating, P., 15, 205, 256, 266, 283, 290, 351, 363, 398, 406, 455, 461, 463, 465, 467, 476, 477, 479, 481, 484 Kelso, J. A. S., 341, 342, 343, 348, 373, 412, 482, 486 Kenstowicz, M., 385, 407 Kent, R., 260, 261, 359, 377, 459-60 Key, T., 258 Keyser, S. J., 218, 250, 292, 298, 302, 309, 313, 320, 326, 350, 351, 357, 384, 385, 400, 407, 445 Kingston, J., 14-5, 435, 446, 448 Kiparsky, P., 286, 290, 298, 325, 326, 385 Kiritani, S., 348 Klatt, D., 105, 152, 154, 174, 191, 195, 201, 203, 206, 277, 406, 454 Klouda, G., 176 Knoll, R., 50, 206 Kohler, K., 8-9, 115, 121, 129, 132, 133, 134, 361, 366, 441 Kohno, T., 182, 196 Koopmans-van Beinum, 467 Koshal, S., 326, 328 Krakow, R., 268, 441, 467, 474, 483, 484 Kubozono, H., 55 Kuehn, D., 362, 370 Kuenzel, H., 483 Kupin, J., 213
Macchi, M., 261, 327, 351, 379 Mackay, D., 478 MacNeilage, P., 454, 463 Maddieson, I., 346, 407 McCarthy, J., 351, 353, 384, 395, 396, 407 McCawley, J., 181, 182, 196 Maeda, S., 379 Malecot, A., 261 Manuel, S., 467, 474, 483, 484 Mattingly, I., 74, 270, 368, 369 Maxwell, J., 277 Meddiss, R., 33 Meillet, A., 392 Mermelstein, P., 400 Mertus, J., 483 Michaelis, L., 271 Michelson, K., 326 Miller, J., 348 Millikin, S., 290, 312, 329 Miner, K., 388, 389 Miyara, S., 182 Mohanon, K., 384 Moll, K., 359, 362, 370, 453, 462, 473, 483 Monsell, S., 50, 206 Morolong, M., 297
Labov, W., 366
493
INDEX OF NAMES Mountford, K., 234, 244, 245 Mueller, P., 176 Miiller, E., 266 Munhall, K., 370, 373, 414, 447, 480 Murray, R., 271, 287, 302, 319 Nakatani, L., 105 Nater, H., 326 Nespor, M., 18-9, 180 Neu, H., 359, 366 Newman, P., 17, 20, 22 Newman, R., 17, 20 Newton, I., 267, 271 Ni Chasaide, A., 410, 418 Nigro, G., 261 Niimi, S., 370 Nolan, F., 466 Nordstrand, L., 112 Norske, R., 309 Occam, William of, 267 Ohala, J., 11, 14, 128, 147, 261, 265, 266, 267, 268, 269, 281, 290, 370, 414, 429, 437, 438, 439, 440, 441, 480, 481, 483 Ohde, R., 463 Ohman, S., 112, 265, 352, 363, 406, 454, 455, 483 Oiler, K., 152, 203 Oshika, B., 359 Osthoff, R , 258 Ostry, D., 370, 373 Paccia-Cooper, J., 46, 152, 154, 174, 191, 195, 201, 203, 204, 206, 209 Panini, 436 Parush, A., 370 Pattee, H., 479 Perkell, J., 483, 486 Peterson, G., 256 Petursson, M., 410, 422, 447 Phelps, J., 268 Picheny, M., 471 Pierrehumbert, J., 4, 6, 7, 11, 31, 33, 35, 37, 38, 39, 42, 43, 45, 47, 48, 49, 50, 54, 55, 59, 68, 69, 72, 74, 104, 145, 150, 154, 155, 156, 176, 181, 182, 277, 278, 454, 455, 477 Pinkerton, S., 438 de Pijper, J., 97, 147 Pike, K., 201, 293, 294, 326 Pitman, I., 437 Plaatje, S., 59 Pols, L., 261, 467 Poser, W., 4, 55, 56, 67, 181, 182, 278 Price, P., 291 Prince, A., 44, 150, 196, 206, 321, 384, 385, 389, 395, 407
Pulgram, E., 300 Pulleyblank, D., 21, 21-2, 33 Rakerd, B., 202 Raphael, L., 261 Rask, R., 437 Recasens, D., 484 Reeds, J., 261 Reilly, W., 327 Reinholt-Petersen, N., 147 Repp, B., 261 Rialland, A., 219, 234, 235, 236, 245, 294 Riordan, C , 438 Roach, P., 366 Roberts, A., 301 Robins, C , 366 Robins, R., 437 Rotenberg, J., 196 Rothenberg, M., 266 Rubin, P., 341, 400 Ryle, G., 477 Saddy, D., 389 Sag, I., 170 Sagey, E., 11, 383, 384, 392, 395, 407 Saka, M., 326 Saltzman, E., 341, 342, 343, 372, 373, 480, 482, 485 Sangare, M., 219, 234, 235, 236, 245 Sapir, E., 2, 325 de Saussure, F., 284 Sawashima, M., 370, 448 Schachter, P., 438 Schaffer, J., 105 Scheib, M., 261 Schein, B., 322, 384, 407 Scherer, K., 140 Schourup, L., 437 Schouten, M., 261, 467 Schuchardt, H., 390, 395 Schvey, M., 370 Schwinger, J., 277 Schwyzer, E., 327 Scott, D., 210 Selkirk, E., 9, 18-19, 101, 153, 155, 157, 174, 179, 180, 191, 195, 196, 206, 286, 290, 292, 297, 300, 312, 318, 328, 351 Sennett, W., 202 Severeid, L., 459-60 Sezer, E., 407 Shadle, C , 399, 426 Shankweiler, D., 342 Shapiro, R., 271 Sharf, D., 463 Shattuck-Hufnagel, S., 406 Shevelov, G., 392 494
INDEX OF NAMES Van Valin, R., 289, 326 Varma, S., 327 Vassiere, J., 377, 458, 460 Vatikiotis-Bateson, E., 373, 482, 486 Vendryes, J., 387 Vennemann, T., 287, 302, 319, 346 Vergnaud, J.-R., 290, 387, 389 Vogel, I., 18-19, 180
Shih, C , 180 Shipley, K., 267 Shockey, L., 359, 369 Sievers, E , 284, 285, 289, 300, 325, 326 Silverman, K., 7, 8, 74, 99, 104, 117, 140, 147, 149, 150 Smith, C , 341 Smith, M., 484 Snider, K., 33 Sorenson, J., 4, 37, 50 Stampe, D., 484 Steele, S., 73, 82, 96, 104, 145, 147 Stephens, L., 286, 328 Steriade, D., 13, 286, 289, 300, 312, 318, 322, 328, 385, 387, 389, 390, 407, 412 Sternberg, S., 50, 206 Stevens, K., 15, 37, 266, 298, 313, 400, 403, 426, 441 Streeter, L., 261 Studdert-Kennedy, M., 342 Sussman, H., 463
Wagner, H., 390, 396 Walley, A., 483 Walusimbi, L., 196 Wang, W. S.-Y., 261, 483 Ward, G., 169 Watson, J., 280 Weeks, R., 359 Weiher, E., 483 Weismer, G., 429 Westbury, J., 264, 271, 438, 448 Weymouth, R., 266, 268 White Eagle, J., 388, 389 Whitney, W., 284, 441 Wilkins, J., 437 Williams, C , 37 Williamson, K., 329 Winitz, H., 261 Wodak, R., 370 Wolf, O., 291 Wooters, C , 271 Wright, C , 50, 206 Wright, J., 268, 441
Tatham, M., 467 ten Kate, L., 437 Terada, M , 182 Thorsen, N., 50 Thrainsson, H., 346, 410, 412 Thurneysen, R., 258 Titze, I., 281 Trubetzkoy, N., 2, 325, 407, 436 Tucker, A., 68, 326 Tuller, B., 348, 412, 482, 486
Yip, M , 59 Yoshioka, H., 410, 426, 428, 446, 447
Uldall, E., 440 Umeda, N., 147 Ushijima, T., 408, 448
Zetterlund, S., 112 Zipf, G., 260 Zue, V., 359 Zwicky, A., 286, 297, 359
Vago, R., 328 Valliant, A., 392 Van Coetsem, F., 294
495
Index of subjects
accents see phrase accents, pitch accents accentual phrases see pitch accents, prosodic structure, phrase types in acoustic theory of speech production, 277 computational efficiency of, 277 alphabetic models of sound patterns, 2 ambiguity prosodic resolution of, 211 speakers' awareness of, 211, 213 ambisyllabicity, 357 approximate see sonority scale, definition of, in terms of major class features articulatory binding see Binding Principle articulatory models, 379 see also Binding Principle, gestures articulatory trajectories see gestures, score for, window model of (co)articulation articulatory tiers see gestures, score for assimilation of manner, 383 assimilation of place anticipatory vs. perseverative, 11, 258-9, 271 putative causes of: ease of articulation, 260; perceptual, 11, 261-5; speech errors, 260 representation of, 259 reproduction in laboratory, 262-5 association (autosegmental) as temporal overlap, 11 Bambara tone, 10, 215, 234-5 downstep, 218, 235, 245 floating, 217, 218, 235, 236-40 lexical, 235, 236 phonetic realization, 244-6; baseline, 244-5; sentence duration on 244—5; topline, 244-5 spreading, 217 Basic tone value (BTV) see fundamental frequency contours
binary features see features, types of Binding Principle, 14 acoustic-auditory basis of, 14—15, 408-9, 425-6, 430-1, 435, 439-41, 445-6 aerodynamic basis of, 408-9, 426, 428, 432, 435, 437-9, 445-6 alternatives to, 446-7, 448-9 counterexamples to, 14-15, 416-18, 424-9, 430: complementarity of closure and aspiration duration, 425, 428-9; fricatives, 425, 426, 430-1, 446; obstruent clusters, 425, 426-8, 431, 446-7; unaspirated stops, 425-6, 430-1, 447-8 definition of, 409, 435 extensions of, 435-41 functional motivation for, 14, 408-9, 424-5, 431 glottal-oral coordination, 14-15, 408-9, 424—30, 446: relational in variance, 412; see also correlational analysis manner of articulation in, 14—15, 407-9, 430, 432, 435-6, 445-7, 448-9 nasal assimilation to labio-velars, 441 predictions of, 412, 416-18, 424-9, 446-7 reinforcing acoustic effects of multiple oral constrictions, 439-41: English / r / , 440-1; labio-velars, 4 3 9 ^ 0 soft palate-oral coordination, 437-8, 441: devoicing, 437-8; frication, 437-8; obstruents, 437-8; perseverative nasalization, 437; vowel height and nasalization, 441 spontaneous nasalization, 441 variation in tightness of coordination: acoustic consequences, 448-9; inventory size, 441; robustness of acoustic properties, 441-2; states vs. gestures, 436 voicing in obstruents: closure duration, 438;
496
SUBJECT INDEX devoicing and place of articulation, 438; passive expansion of oral cavity, 438 see also Hindi stops, Icelandic pre-aspirates boundary strength, 46 boundary tones, 19, 35, 109, 156, 157 Boyle-Mariotte's Law, 267 casual speech, 12-13, 359-71 changes in gestural phasing or magnitude for, 12-13, 360-71 phonological rules for, 12-13, 359-60, 369-71 see also gestures, overlap of, hidden catathesis see downstep categorical distinctions, see gradience class nodes, 384 coarticulation, 15 blocking by contradictory specification, 462-3, 466 compared to assimilation, 483-4 feature spreading, 452 jaw position, 462-465 nasality, 458-60 overlapping activation of articulatory ensembles, 486 perception of, 483-4 phonetic vs. phonological, 452-3: Arabic emphasis 452-3 propensity for, 466 resistance to, 466 transitional, 452, 453 universal vs. language specific, 467, 474, 484 see also jaw movement, window model of (co)articulation complexity hierarchy, 307-8, 320-1, 323, 328-9 see also French CGV demisyllables, minimal distance constraints, Syllable Contact Law compensatory lengthening and shortening see linguistic (phonological) vs. physical regularities complexity metric see core syllabification principle, Dispersion Principle, syllable structure compression limits on duration, 205 contour segments, 384—5 see also gestures, as theory of direct timing coordination see Binding Principle, gestures, organization of phonological structures core phonology see Core Syllabification Principle, syllable structure, surface vs. underlying Core Syllabification Principle, 299, 301, 313, 316, 317 complexity of syllables, 302-3, 305-7 markedness of syllables, 302
497
maximal onset principle, 300, 314, 316-17, 324, 326 onsets before codas, 300 optimal syllable, 302 typology, 314, 320-1, 324 universal vs. language-specific, 302, 327 see also Dispersion Principle, syllable structure, surface vs. underlying core syllable typology see complexity hierarchy, Core Syllabification Principle, syllable structure correlational analysis part-whole, 414-21 see also Binding Principle cv tier see temporal structure declination see fundamental contours degree of stricture see sonority scale Delta, 10 compilation into C, 217, 255 dictionaries: action, 241, 242, 256; set, 241, 243 expressions: time, 230-1 fences, 233-4, 249 interactive development environment (debugger), 217, 253-4 macro processor, 217 multi-level data structure, 216, 219-20: targets in, 247-8; transitions in, 247-8 operators: at, 231 pointer variables, 225-6 requirements, 217 representation of [h] in, 250-2, 256 rules: action, 224; delete, 240; forall, 231, 233, 234, 240, 241, 242, 243, 246, 248, 249; test, 244, 253 statements: do, 237, 241, 243, 246; execcmd, 254; duplicate, 243, 249, 255; exit, 239; generate, 245; if-else, 237, 246, 248; insert, 225, 226, 246; mark, 253; repeat, 239, 241 streams, 218, 220: default, 229; definitions, 221; time, 229-31 sync marks, 219, 225-6: adjacent, 226; contexts, 228; merging, 226, 236-9, 241; ordering, 226-8; projection of, 225, 255 tokens, 218: attributes, 221; features, 221; fields, 220-3, 255; GAP, 226, 255; synchronization, 223-6; values, 220-3; variables, 233 targets and interpolations, 220, 247 demisyllables, 12, 303, 323-4, 327, 378 markedness of, 310-1 see also complexity hierarchy, Dispersion Principle, length hierarchy, sonority,
SUBJECT INDEX sonority scale, sonority sequencing, syllable structure, constituents of Derived Tone Value (DTV) see Fundamental Frequency Contours designated category parameter, 180 Designated Terminal Element (DTE), 44 diphone see demisyllables diphthongs, 327-8 dispersion of features see sonority cycle Dispersion Principle, 301, 302-6 calculation of dispersion, 304 maximal sonority change in onsets vs. minimal change in codas, 12, 303-5, 315, 316 see also sonority, sonority cycle, sonority scale, sonority sequencing, syllable structure domain-specific explanations, 11, 277, 290 see also phonetic explanations of sound patterns, domain independence of Dorsey's Law autosegmental account of, 393-4 complex onsets as targets in, 389, 390, 395 direction, 392, 396 gestural account of, 390-3 magnitude, 393 romance, 13: Early Latin, 392; Late Latin, 390, 393, 396; Sardinian, 390, 396 Slavic, 13: Eastern Slavic, 392; non-Eastern Slavic 392 syllabification, 391: well-formedness 393, 396 target positions: syllable-internal, 388-91; syllable-peripheral, 391-2 Winnebago, 13, 388-9: metrical restructuring by, 389; ordering with respect to syllabification and stress assignment, 389; morphological structure 389, 395 see also gestures downdrift see downstep, Fundamental Frequency Contours, downtrends downstep, 6, 17, 21, 181, 218, 245 African languages, 44—5 categorical nature of, 42 compression of pitch range, 5 factor 54, 68 independence from tonal contrasts, 42, 43 intermediate tone levels, 4, 5 redundancy of phonological representation of, 7 register shift, 4, 5, 17, 21, 22-4, 26-7, 30, 33, 40-1, 61, 62, 63-4 sequential, 54 suspension of, 64, 69 terminology for, 55 tonal prominence, 6-7, 35 see also Bambara tone, English intonation,
Fundamental Frequency Contours, Hausa intonation, Japanese prosody, Llogoori downstep downtrend see downstep, Fundamental Frequency Contours, declination dynamics see task dynamics end parameter, 180, 196 English formant timing, 10, 215, 217, 246-53 durations, 247-8 targets, 247 transitions, 247-8: explicit, 247-8; implicit, 247-8 English intonation downstep, 6-7, 35, 40-4; contrasts with African languages, 44—5; phonological representation of, 42-4 features in, 41 phonological representation of, 35 see also downstep, Fundamental Frequency Contour, pitch accents event triggers, 378 see also gestures, score for, phase relations in extrametricality, 165-7, 177, 352 extragrammatical systematicity, 3, 10, 15, 205-6, 209, 210, 477, 478, 482 see alsofinallengthening, Production- vs. perception-based accounts of, linguistic (phonological) vs. physical regularities, phonetic implementation extraprosodicity, 301 extrasyllabic consonants, 289-90, 315, 328 epenthesis triggers, 290 restricted to edges of syllabification domain, 290, 328 features, types of binary, 395 privative (single-valued), 386, 387-8, 395 scalar (multi-valued), 32, 286, 296-8 final dental-alveolar, low of, 311 see also sequential markedness principle final lengthening, 9-10, 74, 96, 101, 110, 142, 143, 152, 157-158, 181, 192, 193, 201,
498
202-3, 206
cue to phrase boundary, 210 determined by focus, 174—6, 213 determined by intonational phrasing, 158-64, 192, 195, 204-5, 208, 212 phonological determined, 154, 176 production- vs. perception-based accounts of, 10, 209-11 syntactically determined, 154, 174, 194-5, 204-5, 206 word-final, 82, 162-8, 176, 194, 203, 204, 208, 212
SUBJECT INDEX see also pitch accents, peak alignment of, prosodic lengthening, prosodic structure final raising see Hausa intonation, raising of pitch floating tones, 22, 217, 218, 235, 236^40 see also Bambara tone, Hausa intonation focus interaction with final lengthening, 174—6, 213 hierarchical representation of, 174-6 see also prosodic structure formal explanation of sound patterns, 259-60, 267, 290 computational efficiency of, 277-8 level of explanation, 278-9 value of explicit formalization, 278 see also phonetic explanation of sound patterns French CGV demisyllables, 309-10 see also complexity hierarchy Fundamental Frequency Contours Bambara, 244-6 declination, 4, 54-5, 66-7, 205, 244, 256 downtrend, 3-5: hybrid account of, 4—5; phonology of, 4-5; phonetics of, 4-5; syntactic constituency, 4 English: phonological representation of, 35 functions of, 115, 120, 141 hierarchical phonological representation of, 36 initial value: sentence length, 50; syntactic structure, 51-3, 55-6 level vs. shape of, 39 linearity of phonological representation of, 36 local adjustments, 68 micro-vs. macro-scopic ( = micro- vs. macroprosody), 115-16, 117, 118, 128: automatic recognition of, 140; microprosodic contour shape, 136-7, 149; interaction with duration, 129-31, 137, 139-40; language-specific differences in micro-prosody, 136; perceptual integration/separation of micro- and macro-prosody, 8, 136-7, 139, 147-50; perceptual salience of microprosody, 136, 146-50; speaker control of micro-prosody, 147; synthesis of, 117, 140-1 models for realization of, 38-40, 53-4, 65-9 tone values: basic, 68, 69; derived 68 see also downstep, lowering of fundamental frequency, prosodic structure, resetting germinates, 385 integrity of, 407 see also linked sequences, partially assimilated clusters German stress see pitch accents 499
gestural overlap see pitch accents, peak alignment of gestures adequacy of: descriptive, 378, 400-3; explanatory, 378, 400-3 alternative to autosegmental representations: length distinctions, 384—8; overlap, 383-8; temporal structure, 384-8; see also length, temporal structure articulatory tiers, 346, 350-9: consonantal and vocalic, 352-8, 371; contiguity on, 354; coordination between, 14; coordination within, 14; functional, 350, 351-9, 371; oral projection, 351, 353; autosegmental tiers, 13, 353; linear feature matrices, 353; rhythmic, 350-1 correspondence to articulates, 383 correspondence to distinctive features, 383 deficiencies of: exclusion of acoustic-auditory aspects of phonological patterns, 400-2; failure to pick out natural classes, 401-2; insufficient abstractness, 400-1; overly restrictive phonological processes, 400-1 direct timing theory, 383-8: Dorsey's Law, 388-94; overlap, 386-7, 388-94; representation of contour segments, 386-7; simultaneity, 386-7; see also length, temporal structure evaluation of, 13, 400-3 general properties of, 342, 346 hidden, 360, 363-6, 371: see also Icelandic pre-aspirates, components, glottal-oral coordination lumping of place and manner of articulation in, 383, 402 magnitude change, 366, 369, 371: deletion, 366, 369 overlap of, 360-3, 363-8, 371: across tiers, 360-2; assimilation, 361-2, 369; coronals in, 362, 368, 369; deletion, 360; syllable structure, 368-9, 371; within tiers, 362-3; see also Icelandic pre-aspirates, components, glottal-oral coordination phonological constraints on: cohesion, 479; communication, 479; separability, 479; see also linguistic (phonological) vs. physical regularities representation of subphonemic timing, 382 Scores for, 343-6: coordination between gestures by, 12, 348-50; functional tiers in, 350, 351-9; phase relations in, 12, 348, 354-9, 368, 371, 377; rhythmic tiers in, 350-1, 373 tract variables, 343-6, 350: additional, 398; constriction degree, 343, 351, 372; constriction location, 343, 346, 372;
SUBJECT INDEX dependencies between, 400; features and natural classes, 401; mapping phonological features onto, 403; speaker variation in, 398-9 see also casual speech, Dorsey's Law, jaw movement, phonetic implementation, task dynamics global raising, see Hausa intonation, raising of pitch gradience coarticulation, 453 contrast between gradient and categorical distinctions, 6, 32, 36-8, 39-41, 53 downstep, 42 prominence, 36-8, 39-41, 53, 60, 61-2, 68 underspecifiction, 461, 465 see also downstep, English intonation, phonetic implementation Hausa intonation downstep, 6, 17, 21, 22-4, 26-7, 33, 63-4: suspension of, 64 emphasis, 19, 25-7, 31, 70: relationship to downstep, 26-7; relationship to phrase boundaries, 26 floating tones, 32 ideophones, 27-30, 31, 64-5, 70 raising of pitch: final raising, 65; global raising, 18, 23, 30, 64; high raising, 18; ideophone raising, 28, 64-5; low raising, 18; question raising, 6, 23-4, 28, 31, 64-5 tonal nodes, 20-1, 63 tone features, 19, 31-2: scalar, 32 tone tiers: primary, 6, 19-21, 63; register, 6, 19-21, 22-4, 30, 58, 63, 65, 69, 70 see also downstep, register, Zulu questions hierarchies discrete vs. continuous elements, 297 independence of defining properties, 297 see also complexity hierarchy, length hierarchy, prosodic structure, sonority hierarchy high raising, see Hausa intonation, raising of pitch Highest Terminal Element (HTE), 44, 47, 50, 59-60, 62 Hindi stops Glottal-oral coordination in, 410, 413, 416-18, 430 see also Binding Principle icebergs, 365, 380 Icelandic pre-aspirates, 14 components of pre-aspiration, 418-23: differential variability, 420-2; perceptual value, 418, 423 5OO
dialect variation in, 411 glottal-oral coordination in, 409-24, 430: sliding, 423; see also gestures, hidden, overlap phonology of, 410-12, 430: contrast with post-aspirates, 411-12; relation to clusters, 411-12; relation to quantity contrasts, 411 rarity, compared to post-aspirates, 410 see also Binding Principle ideophone raising see Hausa intonation, raising of pitch inalterability, 407 intonation American vs. British schools of, 72 tone vs. non-tone languages, 62, 63 see also Fundamental Frequency Contours, pitch accents intonational phrase see prosodic structure, phrase types in interpolation see targets (articulatory) invariance, 402-3 see also pitch accents, invariant relations between, peak alignment of isochrony see stress timing Japanese prosody accented vs. unaccented words, 181 declination in, 55 downstep in, 278 final raising in, 6 hierarchy of constituents in, 9, 181-2, 199 relationship of syntactic structure to, 55-6 jaw movement coarticulation from vowel to consonant, 462-5 coordination with lip movement, 481-2: equifinality, 482 see also motor equivalence; phonetic rule, 481-2 coordination with tongue movement, 399 phonological specification, 398-9, 462 speaker variation in, 398-9 tract variable see gestures see also coarticulation, window model of (co)articulation key raising, 17, 20 see also Hausa intonation laboratory phonology see methods, phonetic implementation length disparity between phonetic and phonological, 395 distribution of tonal contours, 387-8 phonological distinctions, 384—5, 395
SUBJECT INDEX see also contour segments, geminates, partially assimilated clusters representations of, 384—6 see also gestures, temporal structure length hierarchy, 307-8, 323 length restrictions on demisyllables, 310 line-crossing constraint, 395, 407 linguistic (phonological) vs. physical regularities, 477 assimilation vs. coarticulation, 483-4 compensatory lengthening and shortening, 480-1 controlled vs. automatic patterns in utterances, 477, 482 duality of segments, 478-9 nontranslatability between, 478 phonetic origin of phonological patterns, 480 plans for utterances vs. their execution, 478, 479 phonological patterns as abstract constraints on articulatory dynamics, 479 universal vs. language-specific, 479, 482 vowel shortening before voiceless consonants, 479
see also extragrammatical systematicity, misperception by listeners, phonetic implementation linguistic signs, nature of, 2 linked sequences 321-2, 329 see also germinates, partially assimilated clusters Llogoori downstep, 66-7 low raising see Hausa intonation, raising of pitch lowering of fundamental frequency final, 4, 47, 54, 154 initial, 181 see also Fundamental Frequency Contours, resetting macro-prosody see Fundamental Frequency Contours major class features, status in feature geometry, 322 see also sonority scale major phrase see Japanese prosody, hierarchy of constituents in mapping conventions, 407 markedness of segments, 296 marking conventions, 259, 298 mass-spring models, 12, 346-8, 372, 373 critical damping in, 346-8, 377 equilibrium of, 347-8, 373: peak displacement, 348 mass in, 378
stiffness in, 347, 351, 354, 357-8, 373, 378: stress, 351, 373; velocity, 378 Maximal Onset Principle see Core Syllabification Principle, Dispersion Principle, syllable structure maximal projection, 180-1, 182, 193 Maxwell's equations, 277 methods distinctions between phonetic and phonological, 1-3: criteria for choosing between, 4 hybrid, 3, 4-5 metrical representations foot, in relation to word boundary, 9 grids, 9, 44, 153, 155-8, 196, 199: silent beats in, 101, 153, 196, 199 hierarchy of, 35 linearization of, 43-4 nonlocal dependencies in, 45-7, 60-1 power of, 45, 59, 67 redundancy of, 60, 61 register, 41-4, 58, 62: nesting of, 47-50; scaling of high peaks, 50-3 trees, 43-4, 153, 156: feet, 153; tonal feet, 44 see also downstep, Fundamental Frequency Contours, prosodic structure micro-prosody see Fundamental Frequency Contours minimal distance constraint, 312, 313, 314, 317-18, 328-9 see also complexity hierarchy, sonority sequencing minor phrase see Japanese prosody, hierarchy of constituents in misperception by listeners insensitivity to coarticulation, 483-4 origin of phonological patterns, 481, 483 sound change, 266 Mora, 196 motor equivalence, 379, 482, 485 see also jaw movement, coordination with lip movement multiple regression see regression neutralization of tonal contrasts in Ancient Greek and Lithuanian, 387-8 see also gestures, length, temporal structure nonconcatenative morphology, 353 notations, defectiveness of, 259 improvements on, 259-60 Obligatory Contour Principle (OCP), 26, 28, 407 onsets see syllable structure, constituents of
SUBJECT INDEX organization of phonological structure between tiers, 13, 14-15, 406-7 determined by phoneme inventory, 481, 482 features, 407, 436-7, 445 formal principles of, 407 linguistic constraints on, 477 physical constraints on, 341, 477 segments, 406-7, 436-7, 445: simultaneity constraint, 406-7, 445 within tiers, 14, 15 see also Binding Principle, gestures, units of phonological representation
see also extragrammatical systematicity, gestures, linguistic (phonological) vs. physical regularities, psychological reality, evidence for, task dynamics phonetic syncretism, 8 phonetic underspecification see window model of (co)articulation phonological boundary coincidence, 179-80 phonological component of a grammar see phonetic implementation phonological hierarchy see prosodic structure phonological mediation see pitch accents, peak alignment of phonological rule domains see prosodic partially assimilated clusters, 385 structure, hierarchy of performance see phonetic implementation, phonological phrases see prosodic structure, psychological reality, evidence for peripherality condition, 301 phrase types in phonological word see prosodic structure, word see also extrametricality, extraprosodicity, types in extrasyllabic consonants phonological underspecification see window phase see gestures, score for phonetic affix, see syllable, affix model of (co)articulation phonetic component of a grammar see phonetic phrase accents, 31, 35, 156, 176-7 pitch accents implementation phonetic explanation of sound patterns, 266-70, downstep as relation between, 42-4 English pitch accents, 35 435-6, 437-42 German prefix and stem stress, 121, 126-7, Acoustic/auditory constraints, 268, 439-41 139, 141, 145 production constraints, 268, 437-9 contrast with nonlinear (autosegmental) invariant relations between, 47-50 Japanese pitch accents, 181 models, 11, 269-70, 442: computational nuclear, prenuclear, postnuclear, 72, 74, 96, explicitness, 11, 269-70, 281-2; 103, 105, 156, 166, 167, 169: relative description and taxonomy, 269-70; prominence of, 39-40, 55, 60 naturalness, 269-70, 441; pedagogical peak alignment of (timing of), 7, 74, 107, 108, effectiveness, 269-70; psychological reality, 109: changes in, 120-1; final lengthening, 269-70; representation of sound change, 74; inter-accent distance, 99-101, 112; 269-70 gestural overlap in, 75-7, 97-101, 104, Contrast with formal explanations, 267, 411 112; invariance in, 7, 75-7, 80, 96-7, 104, deductive vs. inductive generalizations, 280 109, 111; peak delay in, 80, 86-7, 97, 104; domain independence of, 267, 276, 280, 281, peak proportion, 104, 86-7, 97, 111; 291, 441 phonological mediation of, 7, 75-7, 101-2, explanation vs. description, 280, 441 interaction with cognitive factors, 278 104, 112; prosodic structure, 73-5, 80, 82, level of explanation, 278-9 84-5, 87-88, 90-5, 96-7, 104; rate, 74, 82, power of primitives in, 11, 267, 271 87, 96-7, 108: semantic value of 115, reductionist character of, 11, 276, 280, 281, 441 136-7, 139, 145; sonority profile, 7, 75-7, value of explicit formalization, 278 102-3, 104, 113; stress clash, 7, 80, 82, 87, see also formal explanation of sound patterns 90, 95, 96, 99, 101, 104, 105; stress-group phonetic implementation boundaries, 109, 141; tonal repulsion in, 76-7, 97-101, 104, 112; word boundary, phonetic rules, 451, 476: failure of universal 80, 82, 87, 90, 96, 99 phonetics, 476; interaction between, 6; need for, 479, 481-2, 486-7; phonological phrases defined by, 164-8, 170, 173, 192, analogues of, 476, 480 193, 212 relation to (division of labor with) relative prominence of, 36-7 phonological representation 1, 2-3, 10-11, representation in metrical grid, 157 representation in metrical tree, 43-4 13-14, 17, 22, 58, 59, 63-4, 73, 101-3, Swedish word accents, 7—8, 109 107, 108, 141, 156-7, 196, 208-9, 212, Swedish focal accents, 109 217, 244-6, 283, 400-4, 451-3, 477-81 502
SUBJECT INDEX tonal composition of, 8, 107, 145: languagespecific differences in, 8 tonal crowding, 109 see also gradience pitch range constant, 38 definition, 59 default highest level, 55 speaker baseline, 53 variation in, 36-7; in variance of relations within, 47-50 see also gradience, register place perception asymmetries between pre- and post-vocalic consonants, 261-5: crosslinguistic differences in, 262; greater robustness of stop burst over formant transitions, 261, 265; mediation by linguistic experience, 264-5 concentration of cues to, 472 see also assimilation of place, putative causes of pre-boundary lengthening see final lengthening, prosodic lengthening precision of articulations see window model of (co)articulations pre-pausal lengthening see final lengthening, prosodic lengthening primary tone tier see Hausa intonation privative features see features, types of projection, 407 prominence see downstep, gradience, pitch accents, prosodic structure, register prosodic lengthening, 90-95, 96, 97, 100-1, 101-2, 108, 110, 111, 158, 204-5 see also final lengthening, prosodic structure prosodic context see prosodic structure prosodic structure constituents of, 6-7, 153, 154-5, 192: edges, 9, 152, 158; heads, 9, 152 coordination of segments and melody by, 73 hierarchy of, 17, 18-9, 35, 36, 43-4, 156, 179-80, 191, 196, 201 onsets vs. rimes, 102, 105, 113 parameters of, 140 phrase types in, 9, 181, 380: accentual, 164-8, 170, 173, 192, 193; intonational, 17, 158-64, 192, 194 syntactic structure, 9, 17, 18-9, 33, 45, 51-3, 154-5, 180, 191, 192, 193, 194-5, 196, 201, 20^5, 209-10 representation of prominence in, 8, 9, 36, 156-7, 191, 192, 194 well-formedness conditions for, 180 word types in: function, 182; lexical, 180-1, 503
182, 193; phonological, 168, 171^, 181, 196, 213, 380; prosodic, 9, 193 see also downstep, final lengthening, metrical representations, syllable structure prosodic word see prosodic structure, word types in psychological reality degrees of: strong, 10, 208-9, 212; weak, 10, 208-9, 212 evidence for: perception, 10, 211-12; production, 10, 211-12 quantity see length question raising see Hausa intonation, raising of pitch raised see tone features raised peak, 41, 55 reference line, 36-7 register (pitch), 6-7, 36, 58 default initial, 54 definition, 59 determination of tonal targets within, 38-40 distinction between register and partition of pitch range, 21, 36 global changes in, 59 phonology of, 41-4, 59, 62 shifts in, 21, 22-4, 43-4: constraints on, 39; prominence variation by, 38-40, 48, 61; vs. local F o adjustments, 66, 68 subset of pitch range, 38, 53, 59 width, 54 see also English intonation, Hausa intonation, Llogoori intonation register tone tier see Hausa intonation regression independent variables: main effects, 80, 88; interactions, 93, 105 lines: fit, 88; data trends, 74, 82, 92 regression coefficients: partial, 81; semipartial, 105 proportion of variance accounted for, 81, 104, 105 proportion of variance unaccounted for, 88-90 Relative Height Projection Rule (RHPR), 44, 47, 53, 59-60, 62 Relative Prominence Projection Rule (RPPR), 44 resetting phrase-initial, 45-7 see also lowering of fundamental frequency Resolvability Principle, 309 resonance, 286 rimes see syllable structure, constituents of
SUBJECT INDEX rhythmic regularity estimation of by listeners, 210-11 final lengthening, 210-11
universal vs. language-specific, 287, 296, 323 see also demisyllables, minimal distance constraints, Sequential Markedness Principle, sonority, sonority cycle, sonority sequencing, Syllable Contact Law, syllable scalar features see features, types of structure sense unit, 195 Sequential Markedness Principles, 311-14, 315, sonority sequencing, 283-7, 314—16, 325 asymmetry between onset and coda, 301 323 between syllables, 329 coronals' status in, 311-14 coronals' status in, 311-14 see also demisyllables, minimal distance principle of, 285, 286, 287, 299: level of, constraints sonority, sonority scale, sonority sequencing, syllable structure 287; domain of, 287 shared features convention, 407 place of articulation in, 311-14 simple regression see regression universal vs. language-specific, 300 single valued features see features, types of see also demisyllables, minimal distance constraints, Sequential Markedness privative features Principle, sonority, sonority scale, sonority sonorant see sonority scale, definition of, in sequencing, Syllable Contact Law, syllable terms of major class features structure sonority, 11 decay in final vs. non-final codas, 300-1 speech errors, 211, 406 see also units of phonological representation laryngeal consonants, 322 multi-valued feature, 296-8: see also features, speech posture, 486 SRS types of peaks, 287 linear representations in, 215-16 perceptual salience, 297-8: of coronals, 312 targets in, 247, 256 plateaus and reversals of, 287-9, 326 transitions in, 247, 256 phonetic vs. phonological basis of, 290-2, units of analysis in, 215-16, 247 326, 403 stress rank, 303 phonetic realization of: duration, 142-4, 146; redundancy of, 292 unreduced vowels, 156 Stress clash, 206 see also demisyllables, dispersion principle, sonority cycle, sonority profile, sonority see also pitch accents, peak alignment of scale, sonority sequencing, syllable contact stress timing, 104, 109, 152 law, syllable structure foot, 9, 168, 170, 173, 193, 206 sonority cycle, 284, 298, 299, 323. 325 shortening due to, 9, 152, 153, 157, 201, feature dispersion in, 298 202-3, 204, 206, 208 see also sonority, sonority scale, sonority strict layer constraint, 180, 181, 196 sequencing, syllable contact law, syllable structure preservation, 298 structure see also core phonology, core syllabification sonority hierarchy see sonority, sonority scale principle sonority profile see pitch accents, peak subglottal air pressure in downtrends, 4 alignment of see also Fundamental Frequency Contours sonority scale, 11, 284, 314 surprise-redundancy contour, 170, 172-3 coronals' status in, 311-14 Swedish accents see pitch accents definition, 287: in terms of major class syllabic see sonority scale, definition of, in features, 11-12, 284, 286, 292-5, 298, 322, terms of major class features 403; primitive, 290-2, 322 syllabic segment, 293-4 distance along, 303 Complexity of, 328 liquids vs. nasals, 326 fineness of distinctions along, 295-6, 315-16 syllable laterals in 293 phonetic vs. phonological basis of, 12, 284, as organizing principle for segments, 403 290-2, 297-8, 322-3 as phonological, not phonetic, unit, 403 place of articulation in, 311-4 as sonority peak, 156 ranking of speech sounds by, 284 Syllable contact law, 286-7, 313, 314, 319-20, redundancy rules for, 295 324 syllabic vs. other major class features in, 295 see also complexity hierarchy, sonority, 504
SUBJECT INDEX sonority cycle, sonority scale, sonority sequencing, syllable structure syllable structure, 11 affix, 290, 352 asymmetry between onset and coda, 301 complexity of, 308, 311: multi-member demisyllables, 306-7; one-member demisyllables, 306 Constituents of, 323—4 Core, 290, 327 core typology, 314, 320-1, 324 crosslinguistic preferences for particular types of, 283 margin, 284, 285 Maximal Onset Principle, 300, 314, 316-17, 324, 326, 357 peak, 284, 285, 287 surface vs. underlying, 287-90, 298-9, 323 universal vs. language-specific, 290, 4 0 3 ^ vowel-initial, 301 see also Core Syllabification Principle, demisyllables; Dispersion Principle, Sonority, Sonority cycle, sonority profile, sonority scale, sonority sequencing, Syllable Contact Law targets (articularoty), 15, 454 context-sensitive vs. invariant, 454, 461 interpolation between, 454, 455 spatial evaluation, 454 temporal evaluation, 455 under- and over-shoot, 454, 460 see also coarticulation, delta, English formant timing, SRS, window model of (co)articulation task dynamics, 13, 342, 343, 346-8, 371, 377 compared to piece-wise approximations, 377 coordination: between gestures, 346, 348; within gestures, 346, 348 organization of articulates by, 485-6 overlapping activation of articulatory ensembles, 486 see also mass-spring models, phonetic implementation, window model of (co)articulation teleology in explanations for sound change, 266 see also assimilation of place, putative causes of, perceptual, misperception by listeners temporal structure CV tier as an indirect timing theory, 384, 385-8 direct vs. indirect representation, 383-8, 388-94 interaction with tonal structure, 111-12, 126-7, 139-40 overlap, 383, 386-8, 388-94: partial, 384 505
phonetic representation, 13 phonological representation, 13 precedence, 383 simultaneity, 383, 384, 386-7 see also gestures, length timing see temporal structure tonal foot, 21, 44, 68 hierarchical organization of, 44, 68 see also metrical representations, prosodic structure tonal grid, 38-40 tonal nodes, 20-1, 63 see also Hausa intonation tonal repulsion see pitch accents, peak alignment of tonal structure interaction with temporal structure, 111-12, 126-7, 139-40 tonal targets register, 38 scaling of, 35 tone features, 17, 19, 31-2, 41 raised, 21-2 upper, 21-2 see also English intonation, Hausa intonation tone tiers see Hausa intonation tongue twisters, 213-14 underspecifiction see window model of (co)articulation units of phonological representations, 382-3, 403-4 autosegmental, 407 discreteness of, 406-7 features, 407, 436-7, 445 relations between, 382, 383^4 segments, 406-7, 436-7, 445: idealization of utterance, 451 speech errors, 406 see also Binding Principle, gestures, organization of phonological structure upper see tone features upstep, 55, 60, 63 virtual pause structure, 195-6, 199 vocal tract variables see gestures, structure of vocalic, 294-5 vocoid see sonority scale, definition of, in terms of major class features vowel shortening before voiceless consonants see linguistic (phonological) vs. physical regularities window model of (co)articulation, 15 acoustic windows, 468-9 alternatives: undershoot, 460:
SUBJECT INDEX underspecification, 460, 461; see also output constraints (this entry) evaluation, 456, 468 for jaw position, 462-5 for velum position, 458-60: between nasal and vowel, 472; consonants vs. vowels, 459-60; interaction with perception of vowel height, 472-3 interpolation in, 15: constrained by continuity and minimal effort, 456; window width, 457-8, 463, 468 output constraints: canonical variants, 467-8; inventory size or distribution, 467, 474 phonetic underspecification, 461, 465 phonological underspecification, 465 selection of regions within windows for perceptual enhancement, 471, 747 targets in, 15: contextual range of spatiotemporal values, 455; extreme values, 456; extrinsic allophones, 456; not central tendency, 455, 457; not invariant, 461; physical values, 456; precision, 15, 472-3; reduction of variability by context, 461-2; stationarity of, 471
universal vs. language-specific 467 Width: acoustic constraints on, 473; articulatory constraints on, 473; compensation by other articulates, 485-6; consonants vs. vowels, 459-60; constriction location vs. vocal tract shape, 473; number of features implemented in a dimension, 465-6; phonological contrastiveness, 459-60, 465-6; propensity to coarticulation, 466-7; precision, 472-3; position in range, 466; underspecification, 461, 465, 472, 473; resistance to coarticulation, 466; variation between members of natural class, 15, 472 see also coarticulation, targets (articulatory), task dynamics word boundary see pitch accents, peak alignment of x-ray microbeam, 348-9 Zulu questions, suspension of downstep in, 69 see also downstep, Hausa intonation
506