Vocalize to Localize (Benjamins Current Topics)

Vocalize to Localize Benjamins Current Topics Special issues of established journals tend to circulate within the orb...

Author: Christian Abry | Anne Vilain | Jean-Luc Schwartz

65 downloads 981 Views 5MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Vocalize to Localize

Benjamins Current Topics Special issues of established journals tend to circulate within the orbit of the subscribers of those journals. For the Benjamins Current Topics series a number of special issues have been selected containing salient topics of research with the aim to widen the readership and to give this interesting material an additional lease of life in book format.

Volume 13 Vocalize to Localize Edited by Christian Abry, Anne Vilain and Jean-Luc Schwartz These materials were previously published in Interaction Studies 5:3 (2004) & 6:2 (2005), under the guidance of Editor-in-Chief Harold Gouzoules.

Vocalize to Localize

Edited by

Christian Abry Anne Vilain Jean-Luc Schwartz

John Benjamins Publishing Company Amsterdam / Philadelphia

8

TM

The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.

Library of Congress Cataloging-in-Publication Data Vocalize to localize / edited by Christian Abry, Anne Vilain, Jean-Luc Schwartz.        p. cm. (Benjamins Current Topics, issn 1874-0081 ; v. 13) Previously published in Interaction studies 5:3 (2004) & 6:2 (2005). Includes bibliographical references and index. 1. Oral communication. 2. Visual communication. I. Abry, Christian. II. Vilain, Anne. III. Schwartz, Jean-Luc. IV. Interaction studies. P95.V63 2009 302.2'242--dc22 isbn 978 90 272 2243 5 (hb; alk. paper) isbn 978 90 272 8951 3 (eb)

2009003060

© 2009 – John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The Netherlands John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa

Table of contents Foreword: Vocalize to Localize: How to Frame a Framework for two Frames? Introduction: Vocalize to Localize? A call for better crosstalk between auditory and visual communication systems researchers Christian Abry, Anne Vilain and Jean-Luc Schwartz

vii 1

Vocalize to Localize: A test on functionally referential alarm calls Marta B. Manser and Lindsay B. Fletcher

13

Mirror neurons, gestures and language evolution Leonardo Fogassi and Pier Francesco Ferrari

29

Lateralization of communicative signals in nonhuman primates and the hypothesis of the gestural origin of language Jacques Vauclair

47

Manual deixis in apes and humans David A. Leavens

67

Neandertal vocal tract: Which potential for vowel acoustics? Louis-Jean Boë, Jean-Louis Heim, Christian Abry and Pierre Badin

87

Interweaving protosign and protospeech: Further developments beyond the mirror Michael A. Arbib

107

The Frame/Content theory of evolution of speech: A comparison with a gestural-origins alternative Peter F. MacNeilage and Barbara L. Davis

133

Intentional communication and the anterior cingulate cortex Oana Benga Gestural-vocal deixis and representational skills in early language development Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

159

179

vi

Table of contents

Building a talking baby robot: A contribution to the study of speech acquisition and evolution Jihène Serkhane, Jean-Luc Schwartz and Pierre Bessière

207

Aspects of descriptive, referential and information structure in phrasal semantics: A contruction-based model Peter F. Dominey

239

First in, last out? The evolution of aphasic lexical speech automatisms to agrammatism and the evolution of human communication Chris Code

261

Name index

285

Subject index

293

foreword Vocalize to Localize How to Frame a Framework for two Frames?

Vocalize to localize? Meerkats do it for specific predators… And babies begin to vocalize and point with their index finger toward located targets of interest at about nine months. Well before using language-specific demonstratives. Such that-type units correlated with what-interrogatives are universal and, as relativizers and complementizers, revealed powerful in grammar construction. Even among referential calls in nonhuman primates, some use more than mere localization: semantics and even syntax. Instead of just telling a new monomodal story about language origin, advocates of representational gestures (semantically transparent), with a problematic route toward speech, meet here advocates of speech, with a problematic route toward the lexicon. The present meeting resulted in the contributions of 23 specialists in the behaviour and brain of humans, including comparative studies in child development and nonhuman primates, aphasiology and robotics. The next future will tell us if this continuing crosstalk – between researchers in auditory and visual communication systems – will lead to a more integrative framework for understanding the emergence of 7-month babbling and 9-month pointing. Two types of neural control whose coordination (Abry & Ducey, Evolang7, 2008) could pave the “royal road to language” (Butterworth, in Kita, Pointing, 2003), up to one-year first words, with their semantics and phonology, and their syntax, emerging “on the heels of pointing” (Tomasello et al., Child. Dev., 2007), and beyond, when pointing could be dissociated from the joint word, how it would lead to two-word speech (Goldin-Meadow & Butcher, in Kita, 2003), etc. These are the main unescapable targets presently known of a puzzling route, still to be traced. Instead of a full-blown theory we chose, in the famous legacy of the late Francis Crick for consciousness, a framework approach, including testable proposals. Framed at the beginning of this century, some years ahead of our first meeting (in Grenoble, January 2003, before the second one, VOCOID, May 2007), this Vocalize to localize framework was not

viii

Christian Abry

explicitly developed in the resulting publication of the two Interaction Studies issues (2004–5), now updated for this Benjamins Current Topics. Since, three publications all issued in 2008 will help weighing this framework. The framework flow diagram itself (see Figure 1. A Framework for two Frames) was finally made available for the first time in English by Abry and Ducey in the above mentionned Evolang7 conference book (Barcelona, March 2008, pp. 3–9), in coincidence with the publication of Emergence of Linguistic Abilities (Cambridge Scholars Publishing, 2008, pp. 80–99), held in Lyon (dec. 2005). Meanwhile, we entrusted the test of our core proposal to the issue of a satellite workshop to the XVth International Conference of Phonetic Sciences, held in Barcelona (August, 2003), in honour to Peter MacNeilage, which appeared finally in The Syllable in Speech Production (Lawrence Erlbaum, 2008, pp. 409–427). Evolang7 gives a scope of recent researches in the field; and the VOCOID oncoming publication will contain work in progress for the Vocalize to localize framework. Figure 1 shows that at about one year of age, the Speech Frame will be embedded into the Sign Frame: one-two… Syllables in a Foot template for the first “Prosodic Words”. For the Speech Frame, after Canonical Babbling, say “Syllable” rhythm emergence, two additional controls have to be mastered: Closance control for the “Consonant”, and Coarticulation (Coproduction) for the “Vowel” Postural control, within the “Consonant”. For the Sign Frame, three maturating brain streams become recruited: occipito-parietal event detection (When), which enters the now classical dorsal (Where) and ventral (What) pathways. Their outcomes are

Figure 1. A Framework for two Frames

Foreword

Objecthood permanence and Agentivity (Who-system), while the ventro-parietal How-system affords Shape Affordance, before the objecthood Color What-system. Classically the Sharing Attention-Intention cooperative Mechanisms (SAM-SIM) develop later than Eye Direction Detection (EDD). Among the corresponding “answers” (Then/There/That) to the Wh-systems, the most relevant stream for linguistic pointing (imperative, declarative, cooperative) is our fronto-parietal That-Path (Broca-SMG), together with our Stabil-Loop, the verbal working memory under articulatory gesture phasing control, stabilizing linguistic forms in learning (see Introduction). Given 2-syllable first words, and once we measured a mean of 3Hz for babbling cycles (in agreement with the literature, old and new), the prediction of this framework was a 2:1 Babbling/Pointing ratio. The empirical outcome was that, knowing the distribution of the babbling cycles of six babies, video-recorded from 6 to 18 months each fortnight, we could predict successfully the range of durations of their pointing strokes: in-between 2–3 syllables in a metrical Foot-Point (Emergence of Linguistic Abilities, 2008, pp. 80–99). That is a universal trend for the prosodic Word-Point. At this stage we can state that, if neuro-biomechanical models are still lacking for ultimately giving the bandpasses (modes) of the child babbling jaw and pointing arm, one can already consider that the production of a word each 2/3 of a second is better explained by this cognitive embedded, embodied, embrained arm deixis, than by pure mental or brain lexical encoding chronometry. What we dubbed: “The phonological foot dwells on the arm stroke”! This encouraging achievement was recently acknowledged in the conclusions drawn by MacNeilage for his book The Origin of Speech (Oxford University Press, 2008), where he comments his meeting points with Steven Wise’s chapter on “the primate way of reaching”, atop of The Evolution of Nervous Systems (Elsevier, 2007, vol. 4, pp. 157–166). The development of the pyramidal tracts in primates afforded, from the cortical homologue of Broca region (“mirror-neuronal” F5) downward to cervical spinal and bulbar centers, direct control of head and arm, together with laryngeal and supralyngeal articulators (jaw, lips, velum and tongue). According to Wise, “If […] the human homologue of PFv [prefrontal ventral cortex] maps meanings to communicative gestures, including vocal ones, then perhaps the homologue of PMv [premotor ventral F5] underlies computations that achieve the motor goals of such gestures” (Cortex, 2006, pp. 523–524), i.e. laryngeal-mandibular babbling, eye-head orientation and arm pointing during social signaling (Wise, pers. comm. for this important PF-PM link). MacNeilage (2008, p. 328): “With Wise’s work, we begin to see the promise of an evolutionary cognitive neuroscientific basis for the young child’s triadic declarative pointing acts, in which she points at an object (typically with the right

ix

x

Christian Abry

hand) while looking at the parent, and simultaneously vocalizing [say babbling]. In this context it is of interest to note that Abry et al. [The Syllable in Speech Production, 2008] make an evolutionary argument for a fundamental semantics/action coupling based on the fact that an infant’s pointing movement takes about the same amount of time that it takes to produce two syllables (about two-thirds of a second).” In the line of such an evolutionary tracking of our cortical endowment, we would claim that the directed scratch (me there!) in grooming among wild chimps (Pika & Mitani, Current Biology, 2006, vol. 16, n° 6, R191–2) was the most recent important missing-link discovery (leaving aside controversies about ape pointing in the wild) for social signaling in non-human primates… Ever since the wiping or scratch reflex of spinal frogs and cats! Inside our framework and beyond (see Abry & Ducey, Evolang7, 2008, pp. 6–7), we would like to continue MacNeilage’s quotation: “It’s also interesting to note, in the light of the depth of the evolutionary perspective provided by Wise, that Lashley (1951 [The problem of serial order in behavior]) suggested that our understanding of reaching and grasping might eventually make a contribution to the physiology of logic – a far cry indeed from Descartes’ position.” Thus challenging cartesian Chomsky’s minimalism (see Interfaces + Recursion = Language?, Mouton de Gruyter, 2007), we would boldly phrase our language origin programme as Babbling + Pointing = Language, i.e. speech and not only monomodal “Language within our Point”, with Pointing including eye-hand proto-demonstratives (giving That-recursion), together with eye-head Gaze-shifting (“looking”) as the root of the face proto-predicative attitude. Again, to be continued. Christian Abry, Grenoble, Christmas 2008

Acknowledgements Thanks to John Benjamins for offering this opportunity to make a broader diffusion to the Vocalize to Localize proposal. To Marie-Agnes Cathiard for her help in the establishment of the indices. And to Anke de Looper and Martine van Marsbergen for their editing assistance and their patience.

introduction Vocalize to Localize? A call for better crosstalk between auditory and visual communication systems researchers From meerkats to humans Christian Abry, Anne Vilain and Jean-Luc Schwartz ICP-Grenoble

Vocalize to Localize: Just a speech scientist’s bias? In the last two days of January 2003, the Vocalize-to-Localize conference was held in Grenoble, organized by the Institute of Speech Communication (ICP-Stendhal-INPG, CNRS UMR 5009) and sponsored by a European Science Foundation Project (COG-Speech: From Communication by Orofacial Gestures in Primates to Human Speech), launched in 2002 within OMLL (Origin of Man, Language, and Languages), and presented at the 5th Evolution of Language Conference (Leipzig 2004). The aim was clearly not to suggest seamless continuity between meerkat or vervet alarm calls and full-fledged languages, or even the babbling-pointing, deictic behaviour of human infants. Who would dare tell such a “meerkat-that” story? The story of the English deictic that, demonstrative → relativizer → complementizer, has been perfectly retold by our colleague Elizabeth Traugott. It is seen as a part of a Chomskyan recursion syntactic story: the ball that hit the ball that hit the ball… and Chimpsky sees that Chompsky sees that…(see the heterogeneous joint proposal by Hauser, Chomsky, Fitch, 2000 vs Pinker & Jackendoff, 2005). We know that this evolution is not at all restricted to English: e.g., jumping overseas, this story happened to ken ‘here’, a locative adverb → demonstrative → ‘relativizer’, found in the Buang language, in field work by Gillian Sankoff ’s Montreal team in Papua New Guinea (pers. comm.). Elaborating on Greek deixis, deiktikos — in the tradition of Apollonios Dyskolos in Alexandria (2nd Century), and Maximos Planudes, in the late medieval Byzantium (13–14th Century) — the famous pioneer ‘localists’ of the sixties (just to mention John Lyons), who remain linguists fond of space

2

Christian Abry, Anne Vilain and Jean-Luc Schwartz

and scene analysis in semantic-conceptual representations, would prefer not to be hailed by meerkat calls! The same for mind philosophers, like Zenon Pylyshyn, recently reviving a long scholar tradition about indexicals — now viewed as ‘objecthood’ trackers, backed by Alan Leslie’s and others’ studies on developmental naive ontologies. And perhaps nobody would be able to tell anything more empirically fleshed out about an evolutionary link in the near future. So we did not expect that any participant, whether or not they worked in speech science, — ethologists, developmentalists, neuropsychologists, neuroscientists, and roboticians attending the conference — could bring more to the table than anyone else in what we called in French, our Spanish albergo, a potluck, snowy winter gathering. And they brought and debated well enough, as can be read from the lines printed here and in a following issue of this new hospitable journal Interaction Studies, successor to Evolution of Communication. We extend our thanks to Harold Gouzoules and the reviewers.

Vocalizing to localize predators by conspecific calls Emitting sounds that can be used to localize things in the world, e.g., echolocation with ultrasounds like bats, is not so widespread. The more common situation is that the emitter will be localized by others, as when distress calls of youngsters are detected by their mothers … or by predators. Emitting sounds that can be used by others to localize things in the world — apart from the emitter itself — is achieved by referential speech in all its informational complexity, which sets humans apart from the rest of the animal kingdom. In fact, among human intentional linguistic signals, including those that are the most neglected by linguists (like interjections, Ameka, 1992), some utterances like “Watch it!” or “Timber!” are not so precise — regarding its object for the former, or its directionality for the latter. But when embedded in a situation, they may nonetheless increase one’s chance to stay alive. The same is true for vervet monkeys, suricates (meerkats), and dwarf mongooses. Ground squirrels, prairie dogs and (farm) chickens also differentiate predators by calls. For chickens this is simply a convergence, as interesting as any product of morpho-functional pressure, e.g., streamlining in otherwise unrelated sharks and dolphins. But contrary to what is often said and repeated (from Cheney & Wrangham, 1987 to Tomasello, 2003a), it seems not so difficult to track this faculty throughout the primates, via Barbary macaques, Diana monkeys, ‘up to’ chimpanzees in the wild (Crockford & Boesch, 2003). So what can comparative biology tell us about this faculty in order to better ground language evolution? Deflationists — philosophers, psychologists — have

Introduction

argued against the referential, not to speak of the deictic, character of alarm calls (e.g., Tomasello, 2003a). Some of these calls simply convey — as evidenced by replicated experimental field results — information about the different spatial range of avian, mammalian, reptilian predators, and a certain proximity of the danger (emergency level). This audible augment is comparable to a wartime air raid warning — not to an all-purpose siren recruiting the firemen of my village for any type of accident — and it makes meerkats flee or scan the sky, with no more or less need to check up on conspecifics’ gazes than I would have to do in detecting the planes myself before hurrying toward a shelter. It is thus specific enough, without requiring that meerkats or monkey minds demonstrate intentionality for such localizing information about a presence. Our meeting stance was that, without inflating alarm calls as directly relevant to human deixis and predicate faculties, one could tell something worthwhile about primate ways of pointing and human grammaticalization of deictic tools for getting words. For those who do not reject such a notion from the first call, we hope that readers of these two joint issues will find reasons to keep all these questions open, i.e. track human precursors and not insulating boundaries. Moreover, stimulated by a revival, i.e. the multi-voiced proposal that signers could be given the first place in language evolution, readers will probably enjoy a new version of this now popular debate, once they have realized what could be the core insights offered by the crosstalk between the still significant gestural medium for referencing and the now dominant verbal medium for predicating.

First gestural references with vocal/gestural predicates? Prior to our meeting, no more than the beginning of such a story was told at the 3rd Evolution of Language conference (Paris, 2000) by Bickerton (2002, pp. 218– 221), a professional linguist and pioneer in language evolution, who has long been positing a proto-language stage before language, and later an ecology-based theory of proto-language origin. (For our present argument we need not endorse his global proposal, nor his rejection of social intelligence versus foraging as a selective pressure on language emergence.) Note that the first linguistic communications need not have been mono-modal, nor need their units have been arbitrary in the Saussurean sense. Directional gesturing with the hand, accompanied by the imitation of the noise made by a mammoth, could easily have been interpreted as meaning ‘Come this way, there’s a […] mammoth’. […] Although the two symbolic units (the ‘come’ gesture and the ‘mammoth’ noise) might seem disjoint — two separate, single-symbol utterances, like the one-word utterances of to 15–18-months-old humans — they could easily

3

4

Christian Abry, Anne Vilain and Jean-Luc Schwartz

have been reinterpreted (just as infant utterances at the one-word stage can often be reinterpreted predicatively, Bloom, 1973) as ‘[…] mammoth thataway’, in other words, as the first true predication. And as pointed out above [p. 210], predication — focusing on something, then making a comment about that something — is one of the most basic characteristics of human language, one that clearly distinguishes it from all other animal systems. […] with a minimum of units [hominids] could convey messages regarding the location and nature of available food supplies that would have a direct and immediate impact on the survival of those who heard and correctly interpreted those messages […]. (Bickerton, 2002, pp. 219–220)

Notice that vocal or gestural predication makes no difference for Bickerton and others, whereas vocal localizing is not envisaged in the least bit, even in the case of the meerkat.

Integrating prelinguistic calls into proto-language: Vocal alarm predicates + gestural references? Bickerton’s foraging theory ends with this transitional proposal: Note that under [the threats of predators] (and perhaps only under these circumstances) units from pre-linguistic communication systems might have been absorbed into the proto-linguistic system; Cheney and Seyfarth (1990: 144–9) have shown that, whether or not such calls come under limbic control, their utterance is subject to voluntary modification. Assume that some ancestral species had warning calls that related to major predators, as vervet alarm calls do today. Such calls (perhaps with a different inflection), if coupled with pointing at a python-track, pawprint, bloodstain, or other indication of a possible nearby predator, could very likely have been understood as a warning that did not require immediate reaction, but rather a heightened awareness and preparedness for action. (Bickerton, 2002, pp. 220–221)

The missing link: A ‘referential and conceptual feces’? Suricates (Manser et al., 2002), apart from general alerting calls, display one type of call specific for terrestrial predators, primarily jackals, and another specific for avian predators, including the martial and tawny eagles, and the pale chanting goshawk. In addition: They give a third alarm call type to snakes, such as the Cape cobra, the puff adder and the mole snake. Snake alarm calls are also given to fecal, urine or hair samples of predators and/or foreign suricates [our italics]. Because snake alarm calls to all of

Introduction

these stimuli cause other animals [conspecifics] to approach the caller, give alarm calls themselves, and either mob the snake or investigate the deposit, they are collectively termed recruitment alarm calls. (Manser et al., 2002, pp. 55–56)

This mobbing behaviour is not unique to the suricates, converging in crows as well as in primates. What seems interesting to mention is less the generalizing scope of the call for very different stimuli, than what is empirically new. Unexpected and unpredictable was the link between a snake and various external threats, including those posed by foreign conspecifics and those from terrestrial predators, since suricates have a separate call for the latter. But even more essential is the evidence of the ‘conceptual’ similarity, among these cooperatively breeding mongooses (with sentinels and nannies), of a present snake predator and an absent terrestrial predator or foreign conspecific. This offers a first progressive, incremental, answer to the supposed gap between humans and other animals, typically deemed unable to ‘think’ beyond the hic et nunc. This is empirically remarkable if one wants to track the continuity of these calls toward any “other indication of a possible nearby predator”, long before pointing, and not just as imagined above by Bickerton (2002: 210). Traces, here olfactory ones, are the most obvious possible links between hic et nunc occurring situations, with their specificity, and animal-human neural memories, through exemplarity (see the taste of the overquoted Proust madeleine).

A Sign Language case of reference lumped with the predicate (and beyond) Engberg-Pedersen (2003), a specialist of Danish Sign Language, begins her contribution to Pointing (Kita, 2003) with this seminal anecdote: Once a deaf mother signed charlotte where (‘where is Charlotte?’), Charlotte being her daughter standing right next to her. Charlotte responded by pointing energetically to herself. She did not point to the ground where she was standing as a way of answering the request for a location. Neither did she point first to herself and then to the location to indicate who was where. A point to an entity X in a location Y as a response to the question Where is X? can be seen as a condensed way of saying X is at Y; the point has the same communicative function as a simple proposition used to refer to X and predicate of X its existence at Y. But while the pointing gesture simply links two entities, X and Y, Y is predicated of X in the linguistic expression X is at Y, and in this sense Y is subordinate to X […]. When we point to entities in locations, we do exactly that: we point to the entity not the location. We focus on entities, but use space to keep track of them. The indexical aspect of a pointing gesture is its use of a location in space, but in a pointing gesture the two functions, reference and predication, are expressed by one form. (p. 269)

5

6

Christian Abry, Anne Vilain and Jean-Luc Schwartz

Philosophers are still discussing whether deixis includes location or not. Linguists know that a locative adverb can be grammaticalized as a demonstrative (see above Buang ken). What is less known is that a locative adverb can become a predicate of existence, like Gothic hiri, hirjith, hirjats (‘come!’ 2nd pers. sg, pl, dual; from hêr, compare German hier ‘here’; and see Fillmore 1966, as a pioneer of deixis in verbs). Now linguists have become clearly aware that not all languages display an overt subject-verb predication (see, among others, the illuminating example of ‘omnipredicative’ Classical Nahuatl, as exposed by Launey, 1994).

Commands = predicates with implicit references, and ‘fossils’ Meerkat alarm calls are not statements and they do not predicate. “Note that even commands imply a subject-predicate distinction — ‘[you] do so-and-so’ — and that in any case commands are of little use for doing what language uniquely does: transmitting (purportedly) factual information.” Bickerton (2002, p. 210) reiterates here that his foraging theory fosters a survival pressure and does not need ‘socially intelligent’ commands as precursors of language, even if they do exist among apes, like the most often quoted ‘arm-raise’, an invitation–initiation gesture to chimpanzees’ rough-and-tumble play. Bickerton (2003, p. 85) does not hedge: “[…] if, say, initial utterances were things like Give that to me! […] — you wouldn’t need language to express them”; “Body language is much more reliable for most animal purposes” (p. 83, and n. 4). If, for our purpose, we just take advantage of his acknowledgement that predication is implicit in some existing primate commands, one surely cannot credit Bickerton to be an inflationist on this point (unlike Bloom whom he quoted above for the child one-word stage). He simply joined language philosophers and logicians who have been interested for a long time in finding how to cope, in addition to statements, with commands, questions, etc., say speech acts. Jackendoff (2002, pp. 255–256), elaborating on the proto-language/language stages, strangely puts “questions, commands and exclamations” into Bickerton’s second stage, whereas he acknowledges that they can be universally conveyed by intonation (word order, inflections, function words, depend upon further language-specific grammaticalizations). Some pages ahead (p. 240), he gathered his “ ‘fossils’ of the one-word stage of language evolution” which contain exclamations and even the reputedly human ‘proto-command’ no. “Their semantic and pragmatic diversity suggests that they are island remnants of a larger system, superseded by true grammar.” From the beginning (pp. 131–132) he has exemplified such English-specific “ ‘defective’ lexical items”, as oops! and goodbye!, the latter being a grammaticalization of God be with you. Hence it seems that all that can be said is that there are different

Introduction

‘streamlining’ pressures (pace Bickerton, 2003, p. 89), not ‘fossils’, but basic functions, that compact word constructions into the same integrated templates, as for interjections, exclamations, vocatives, calls… Hail Virgin Mary or hail-fellow-wellmet! French oui (‘yes’) is such a case coming from Latin demonstratives hoc ille (lit. ‘this he (does)’; for hoc only, cf. South Gallo-Romance, so-called Occitan or Langue d’Oc). Another interesting French case — which illustrates the transfer from a command-predicate to a deictic — is the example of voici, voilà, (‘behold’ or ‘here is, there is’), two presentative demonstratives issued from voi(s) imperative, or bare-stem (‘see! look!’), with a locative adverb (locational adposition), proximal (i)ci (‘here’) or distal là (‘there’).

Coupling with pointing, and without leaving the ‘royal road to language’ unpaved “[…] although human infants vocalize and babble from soon after birth, it is gestures that for many children seem to be the first carriers of their communicative intentions. And it is gestures that seem to pave the way to early language […]”. Tomasello (2003b, p. 35) converges here on an important point with the late George Butterworth (2003), who wrote ultimately that, among gestures, pointing opens ‘the royal road’ towards language. The stories they told about the evolution of pointing do not resemble all in all the one proposed even by a functionalist linguist like Givón (1998, p. 85): going from object grabbing to emitting a “specific lexicalised vocal cue alone”, say that! — via reaching, pointing, and adding a general, then a specific vocal cue. But once ape and human so-called ‘imperative’ pointing is superseded by the little man’s ‘declarative’ stance, both authors seem to offer no constraining device in order to derive language from non-language: hence the way is not really paved to shape words and other larger constructions. Butterworth gives general correlational evidence of pointing and language skills. Tomasello relies on pattern-finding, for perceptive as well as for motoric metrical templates. But why are first words one-two syllable long? And why syllables? If not a copy of motherese, this is at least an evolutionary issue. Our answer (Abry et al., 2008) takes the issue that, after the 7-month babbling — say, MacNeilage’s mandibular frame as the origin of proto-syllables — another frame is the 9-month (‘imperative’) pointing. Some data (including ours, but still too sparse), indicate that, while [bababa…] or [dadada…] babbling runs at a 3Hz rhythmic mode, the control of a discrete point arm-stroke is about 1.5Hz. The pointing stroke can thus chunk the babbling flow into one long or two shorter syllables. Identified with the control of the metrical unit known as the foot — manifest after one year for the first prosodic words — this ‘point-foot’ could ensure a crosstalk between a semiotic, symbolic

7

8

Christian Abry, Anne Vilain and Jean-Luc Schwartz

unit, the word, and phonological rhythmic units, the syllables, thus making control resources available for a template, otherwise miraculous. This is not to say that working memory has nothing to constrain into such a template/frame, but just that the precise span for one-year old children is still unknown.

Neural ‘that-path’ and ‘stabil-loop’: Two pieces in the puzzle of language evolution Extrapolating backwards, from increasingly available adult neural-behavioural data, a possible developmental scenario could be the following. 1. Pointing, whether performed with the right index or the left, is a left dominance, as recently evidenced by Astafiev et al. (2003), recruiting the frontoparietal circuits for the eye and the hand; Cantalupo and Hopkins (2001) can add that left anatomical ‘Broca’ dominance in chimps corresponds to a righthand bias, when they vocalize and gesture. 2. This could be linguistically what we call now the left dominant ‘that-path’, fleshing out a bit more the classical and too general dorsal ‘where way’, Hurford (2004) proposed to recruit for deixis: and this is what a neuro-imaging experiment on deixis, via intonation (focus) vs. syntax (extraction), gave as a first result (Loevenbruck et al., 2002, 2005), recruiting a left fronto-parietal circuit for the ‘linguistic laryngeal-oro-facial mouth’, BA47-BA40. 3. Since we finally evidenced that the stabilizing neural circuit for multi-stable verbal percepts (the asymmetric life → fly effect) was actually the phonological loop (Abry et al., 2003, Sato et al., 2004), we now view this input memory device as sensitive to motor control, favouring first long-term winning of more in-phase articulatory gestures (Sato et al., 2002). Thus, e.g., ma, mama win over am, amam, because the lowering tongue gesture for the open phase of the cycle, the vowel, can be anticipated, is in fact set, within the closing phase, the consonant gesture. And this as early as the first words, a remarkable coincidence with such a major step in speech control as coarticulation, or coproduction (Sussman et al., 1999). Therefore the two gestures can be set in-phase in ma, whereas in am the closing gesture obviously cannot be completed simultaneously, in synchrony, with the open vowel. 4. Taken together, our neuro-imaging results show that the left fronto-parietal attentional ‘that-path’ for speech deixis is part of the phonological-loop = verbaltransformation-effect stabilizing circuit: which we dubbed the ‘stabil-loop’. 5. While uniquely intonational deixis (focus: “MADELEINE did it”) activates both frontal and parietal loci, syntactic deixis (with cleft function words entrenched

Introduction

from a former presentative deixis “It’s Madeleine who did it”) deactivates the supramarginal area. Since, meanwhile, Broca’s area remains active, we interpreted this parietal deactivation as a grammaticalization step, for which the elaborated sensory parietal information corresponding to the expected feedback of one’s own action goal (here feeling your voice showing, say pointing) is no longer necessary once the stored deixis packaging is being used. 6. Finally when phonology becomes fluent you just need the perisylvian cortex, without Broca (as evidenced by different groups, e.g. Wise, Ackerman, and ourselves). Coming back to the ‘that-path’, which needs at least part of Broca’s area, we will finally notice that it fits well with what we learned from the empirical construction approach to syntax acquisition up to 4 years, along the remarkable collaboration of Tomasello with Lieven and Diessel, in particular. The use of presentational constructions — Babybot! → That’s Babybot! → That’s Babybot (that is) naughty! → That’s Babybot (that) sai(d) (that) you (are) naughty! — is in straight line with deixis use in syntax, in fact a perfect that-that-that story. But surely more behavioural and neural knowledge is needed on babies that point.

The ultimate lesson from meerkats: No more pure armchair stories! Surely more behavioural and neural knowledge is needed too on animal vocalizations that localize. But aware as we are now of the bulk of tangible field work that has already been done, one should remember that armchair theorists would not have: (1) predicted that different alarm calls for different predators existed in vervets or meerkats; (2) imagined the precise acoustic shape of these calls. So why should linguists, including speech scientists and any other people, have the slightest chance to correctly posit the first word or ‘language fossil’, in-between animal and child data which, together with (paleo-) genetics, will probably be the main growing empirical fields for our phylogenetic quest? The division of the conference contributions in two issues is of course just a matter of convenience. The first issue contains meerkats, monkeys, apes and humans, including Neandertal, ending with a dismissal of any acoustic charge against this related vocal-tract. The second issue deals more specifically with babies, babybot systems, ending with a ‘first-in/last-out’ nascent-remnant view of aphasia. It opens by offering a thread for crosstalk, with a proposal of ‘interweaving proto-sign and proto-speech’ (not to speak of our present introductory motto: put a foot into the arm!). And it is particularly welcome as a link with the first issue, articulating a hot debate… which rebounds as soon as the second issue opens. To be continued!

9

10

Christian Abry, Anne Vilain and Jean-Luc Schwartz

References Abry, C., Sato, M., Schwartz, J. L., Lœvenbruck, H. & M.-A. Cathiard (2003). Attention-based maintenance of speech forms in memory: The case of verbal transformations. (Commentary on Ruchkin, D. S., Grafman, J., Cameron, K. & R. S. Berndt, Working memory retention systems: A state of activated long-term memory, pp. 709–777), Behavioral and Brain Sciences, 26, 728–729. Abry, C., Ducey, V., Vilain, A. & C. Lalevée (2008). When the babble syllable feeds the foot in a point. In: B. Davis & K. Zajdó (Eds.), The Syllable in Speech Production (pp. 409–427). Mahwah NJ:Erlbaum. Ameka, F. ed. (1992), Interjections. Special issue of Journal of Pragmatics, 18 (2/3). Astafiev, S. V., Shulman, G. L., Stanley, C. M., Snyder, A. Z., Van Essen, D. C. & M. Corbetta (2003). Functional organization of human intraparietal and frontal cortex for attending, looking, and pointing. The Journal of Neuroscience, 23(11), 4689–4699. Bickerton, D. (2002). Foraging versus social intelligence in the evolution of protolanguage. In: A. Wray (Ed.). The transition to language (pp. 207–225). Oxford: Oxford University Press. Bickerton, D. (2003). Symbol and structure: A comprehensive framework for language evolution. In: M. H. Christiansen & S. Kirby (eds). Language evolution (pp. 77–93). Oxford: Oxford University Press. Bloom, L. (1973). One word at a time: The use of single-word utterances before syntax. The Hague: Mouton. Butterworth, G. (2003). Pointing is the royal road to language for babies. In: S. Kita (Ed.). Pointing. When language, culture, and cognition meet (pp. 9–33). Mahwah, NJ: Erlbaum. Cantalupo, C. & W. D. Hopkins (2001). Asymmetric Broca’s area in great apes. Nature 414, 505. Cheney, D. L. & R. Seyfarth (1990). How monkeys see the world: inside the mind of another species. Chicago: University of Chicago Press. Cheney, D. L. & R. W. Wrangham (1987). Predation. In: B. B. Smuts, D. L. Cheney, R. M. Seyfarth, R. W. Wrangham & T. T. Struhsaker (Eds.). Primate societies (pp. 227–239). Chicago: University of Chicago Press. Crockford, C. & C. Boesch (2003). Context specific calls in wild chimpanzees, Pan troglodytes verus: Analysis of barks. Animal Behaviour, 66, 115–125. Engberg-Pedersen, E. (2003). From pointing to reference and predication: pointing signs, eyegaze, and head and body orientation in Danish Sign Language. In: S. Kita (Ed.). Pointing. When language, culture, and cognition meet (pp. 269–292). Mahwah, NJ: Erlbaum. Fillmore, C. J. (1966). Deictic categories in the semantics of “come”. Foundations of Language, 2, 219–227. Givón, T. (1998). On the co-evolution of language, mind and brain. Evolution of Communication, 2 (1), 45–116. Hauser, M., Chomsky, N. & T. Fitch (2000). The faculty of language: What is it, who has it, and how did it evolve?, Science, 298, 1569–1579. Hurford, J. R. (2003). The neural basis of predicate-argument structure. Behavioral and Brain Sciences, 26(3), 261–316. Jackendoff, R. (2002). Foundations of language. Brain, meaning, grammar, evolution. Oxford: Oxford University Press. Launey, M. (1994). Une grammaire omniprédicative. Essai sur la morphosyntaxe du nahuatl classique. Paris: CNRS Editions.

Introduction

Lœvenbruck, H., Baciu, M., Segebarth, C. & C. Abry (2002). Prosodic deixis (focus) and syntactic deixis (extraction) in LIFG and LSMG. 8th International Conference on Functional Mapping of the Human Brain, Sendai, Japan. Lœvenbruck, H., Baciu, M., Segebarth, C. & C. Abry (2005). The left interior frontal gyrus under focus: An fMRI study of the production of deixis via syntactic extraction and prosodic focus. Journal of Neurolinguistics, 61, 237–258. Manser, M. B., Seyfarth, R.M & D. Cheney (2002). Suricate alarm calls signal predator class and urgency. Trends in Cognitive Sciences, 6 (2), 55–57. Pinker, S. & R. Jackendoff (2005). The faculty of language: What’s special about it? Cognition, 95, 201–236. Sato, M., Schwartz, J.-L., Cathiard, M.-A., Abry, C. & H. Loevenbruck (2002). Intrasyllabic articulatory control constraints in verbal working memory. Proceedings of the VIIth International Congress of Speech and Language Processes, Sept. 16–20, Denver, USA, 669–672. Sato, M., Baciu, M., Loevenbruck, H., Schwartz, J.-L., Cathiard, M.-A., Segebarth, C. & C. Abry (2004). Multistable representation of speech forms: A functional MRI study of verbal transformations. Neuroimage, 23, 1143–1151. Sussman, H., Duder, C., Dalston, E. & A. Cacciatore (1999). Acoustic analysis of the development of CV coarticulation: A case study. Journal of Speech, Language, and Hearing Research, 42, 1080–1096. Tomasello, M. (2003a). On the different origins of symbols and grammar. In: M. H. Christiansen & S. Kirby (Eds.). Language evolution (pp. 96–110). Oxford: Oxford University Press. Tomasello, M. (2003b). Constructing a language. A usage-based theory of language acquisition. Boston: Harvard University Press.

About the authors Christian Abry, Docteur d’État of the University Stendhal, Grenoble (France), Professor of Experimental Phonetics in 1998. Since Head of the Department of Linguistics (1998–2008). In the last 20 years he has led at the Institute of Speech Communicaton (ICP-INPG-Stendhal CNRS UMR 5009) three groups: Articulatory Modelling, Linguistic Anthropology of Speech, and founded Speech, Multimodality and Development. Principal investigator in several national and international research projects (1993–1995: Prime, with P. Badin, of ESPRIT-BR Speech Maps, coordinating 14 European Labs in speech inversion and robotics). Main interests in speech production, control and robotics, perceptuo-motor interactions, speech working memory, bimodal speech, speech development, speech evolution, and narratives. Now at CRI (Center for Research on the Imaginary), Stendhal. Jean-Luc Schwartz, a member of the French CNRS, lead the Speech Perception Group at ICP (Institut de la Communication Parlée) from 1987 to 2002, and has now been leading the laboratory since 2003. His main research areas involve auditory modelling, speech perception, bimodal integration, perceptuo-motor interactions, speech robotics and the emergence of language. He has been involved in various national and European projects, and has authored or co-authored more than 30 publications in international journals such as IEEE SAP, JASA, Journal of Phonetics, Computer Speech and Language, Artificial Intelligence Review, Speech Communication, Behavioural and Brain Sciences, Hearing Research, Cognition, NeuroImage, about 20 book chapters, and 100 presentations in national and international Workshops. He is the co-editor of a book on speech communication, of a special issue of the Speech Communication and the Primatologie

11

12

Christian Abry, Anne Vilain and Jean-Luc Schwartz

journals, and a co-organiser of the last Audio-Visual Speech Processing conference in 2003. Now at GIPSA-LAB, INPG - Stendhal. Anne Vilain, completed a PhD in Phonetics in 2000, and has been Maître de Conférence in Experimental Phonetics of the University of Grenoble since 2001. Her research is lead within the Linguistic Anthropology of Speech group at the Institut de la Communication Parlée (ICP), and her main research interests are speech motor control, the emergence of language, and the ontogeny of speech production. Now at GIPSA-LAB, INPG - Stendhal.

Vocalize to localize A test on functionally referential alarm calls Marta B. Manser and Lindsay B. Fletcher Verhaltensbiologie, Zoologisches Institut, Universität Zürich, Switzerland / Dept. of Psychology, University of Pennsylvania, USA

In this study of the functionally referential alarm calls in the meerkats (Suricata suricatta), we tested the hypothesis that the ability to refer to a specific location was an important factor in the evolution of discrete vocalizations. We investigated what information receivers gained about the location of the predator from alarm calls with high stimulus specificity compared to alarm calls with low stimulus specificity. Furthermore, we studied whether visual cues about the localization of the predator may be available from the posture of the caller. We described the general behaviour of the caller, the caller’s posture, and in particular its gaze direction. We then observed receivers responding to the different call types, to determine whether the acoustic structure of the calls was enough for them to respond in the appropriate way, or whether they used additional visual cues from the caller. We tested this with specific manipulation experiments, using three set ups of playback experiments: (1) no caller visible; (2) model guard with specific gaze direction; and (3) live sentinel. Natural observations and experiments confirmed that in high urgency situations the meerkats have enough information from the acoustic structure of the call to respond appropriately. When hearing low urgency calls that are less stimuli specific, meerkats used visual cues as an additional source of information in a few cases. This may indicate that functionally referential calls evolved to denote the location of the predator, rather than the predator type or its velocity of approach. However, when discussing this result in comparison to other functionally referential calls, such as the food associated calls and recruitment calls, this localization hypothesis does not appear to apply to the functionally referential calls in general. Keywords: evolution, vocalization, Suricata suricatta (meerkat), localization

14

Marta B. Manser and Lindsay B. Fletcher

Introduction Human speech includes information about a person’s emotional state as well as about external objects. One of the recent discussions on the evolution of language has emphasized the importance of the localization function in the development of vocalization (see Abry et al., introduction to this issue; Abry & Schwartz, 2004). This suggests that discrete acoustic signals evolved when individuals have been selected to denote external objects rather than just express their emotional state. To test this hypothesis one approach is to investigate the calls in animal species that denote to external objects, so called functionally referential vocalisations (Evans et al., 1993). If functionally referential calls contain all the information about the location of an object or an event this could support that the ability to refer to a specific location in the environment was a main selective factor to develop discrete call types. However, it may be that receivers use additional cues on the location of external subjects from the gaze of the caller. It has been shown for several primate species that animals are able to follow the gaze of conspecifics (Tomasello et al. 1998). Seyfarth and Cheney (1980) showed that the alarm calls of the vervet monkeys, Cercopithecus aethiops refer to different predator types. Since then it is commonly accepted that some of the animal vocalizations also denote to external subjects, and are not just the expression of the motivational state of the caller, as had been generally assumed previously (for a review see Hauser, 1996). In the meantime functionally referential calls have been described in alarm calls of several other primate species (Fischer, 1998; Zuberbühler, 2000; Crockford & Boesch, 2003), meerkats Suricata suricatta (Manser, 2001), prairie dogs Cynomis guinnisoni (Slobodchikoff, 1986), and chickens Gallus g. domesticus (Evans et al., 1993). Functionally referential calls have also been found in the context of food acquisition, such as in rhesus macaques Macaca mulatta (Hauser & Marler, 1993) and Capuchin monkeys Cebus capucinus (Di Bitetti, 2003; Gros-Louis, 2003). Discrete call types have been described for different types of predators. However, whether this was specifically predator type, and not the way of approaching, e.g., speed, or the location of the predator is still unclear (Macedonia & Evans, 1993). Since eagles approach from the air at high speed, and terrestrial predators move much slower on the ground, the discrete call types may in fact denote to the direction of the approaching predator, and to a lesser extent to the specific predator type. For example, in dwarf mongoose Helogale parvula the alarm calls convey information on the predator type as well as on its distance and height above ground (Beynon & Rasa, 1989). In meerkats, while call types vary depending on aerial and terrestrial predator, the level of urgency (risk and distance of the predator) explains a large proportion of the variation in the acoustic structure (Manser,

A test on functionally referential alarm calls

2001). Furthermore, meerkats not only emit functionally referential calls, but also emit alert calls to a variety of stimuli in less urgent situations. Meerkats are therefore ideal subjects to investigate whether functionally referential alarm calls allow receivers to extract accurate information on the location of the approaching predator, and whether in less urgent situation visual cues, such as gaze direction of the caller are also important. Meerkats, cooperative breeding mongooses, live in groups of 5 to 40 members. They occur in the dry open parts of southern Africa, and show a high division of labour (Clutton-Brock et al., 1998). The group forages together from one shelter place to the next, and they emit contact calls continuously. They have also evolved a variety of calls to coordinate their vigilance behaviour, including several different alarm calls depending on predator type, and level of urgency to respond (Manser, 2001). The receivers respond immediately to these calls and with different escape strategies to different predator types (Manser et al., 2001). However, meerkats also use three types of alarm calls not highly stimulus specific. They emit a so called ‘panic call’ in highly urgent situations elicited by nearby terrestrial or aerial predators, or bird alarm calls. This results in receivers running immediately to the next bolthole. In low urgency situations, meerkats emit two different alarm calls that result in receivers scanning their environment. The ‘alert call’ is mainly given to aerial predators that are far away, or non-dangerous animals nearby. The ‘moving call’ is given mainly to terrestrial animals, but also to non-dangerous birds flying close above the group, or perched raptors moving their wings. This call type appears to be elicited by the movement of an approaching or leaving animal, and is often followed by the predator specific alarm call. In this paper we investigate whether meerkat alarm calls contain enough information for receivers to locate an approaching predator, or whether they use additional cues of the alarmer’s gaze direction. In particular, we compared the functionally referential calls with the less stimuli specific call types elicited by predators. We used natural observations as well as several different set ups of playback experiments to investigate these questions. We first investigated whether the general behaviour and the posture of callers differ when emitting alarm calls to different predator types, and also in comparison to the less stimuli specific calls. We then asked whether the response to alarm calls by receivers differed when they only heard the call by itself, without a visible caller, and compared it to when they saw the caller, and were able to use visual cues of the gaze direction of the alarm calling individual. Here we used two different approaches: (1) a cardboard model of a guard looking in a specific direction; and (2) a live guard on sentinel duty. We then discuss the results in relation to functionally referential calls in other contexts, such as the food associated calls.

15

16

Marta B. Manser and Lindsay B. Fletcher

Methods Data were collected from June to December, 2000, February to April, 2002, and March to June, 2003 on a wild population of meerkats on a ranch in the southern part of the Kalahari, 30 km west of Van Zylsrus, in South Africa (Clutton-Brock et al., 1998). We followed 6 groups with a total of 82 adult individuals along the dry riverbed of the Kuruman. All individuals were easy to identify by haircuts or natural marks, and were habituated to a degree that allowed us to follow them within 0.5 m.

Posture of alarm caller depending on context We analysed the posture of the alarm caller, in particular the direction of its gaze depending on the context, to investigate whether meerkat alarm caller enhance the meaning of the vocalisations by specific gestures. We divided the context depending on: (1) the predator type approaching: (a) aerial, and (b) terrestrial predator; and (2) the urgency of the situation: high and low urgency (for detail on this categorization see Manser, 2001). We first described the posture of the alarmer by distinguishing three positions of its head, and therefore its gaze direction: (a) horizontal (± 20 degrees) position; (b) partly up (45 ± 20 degrees); and (c) vertically (± 20 degrees) up (see Figure 1). Because we had few observations with the individual looking vertically, we pooled the gaze direction (b) and (c) as looking towards the sky. We also recorded the general behaviour of the caller, and in particular whether it scanned the area, or moved to another location, and the time delay prior to moving. We collected data on 20 predator encounter events per alarm call type (except for the terrestrial medium call category, where we only had 12 observations) in 6 different groups. We avoided recording the response of the same individual to the same call type, by using different subjects in each group

a.

b.

c.

Figure 1. Posture of meerkats explaining the different gaze directions: (a) down to horizontal position; (b) >30 degrees up to the sky; and (c) towards the sky.

A test on functionally referential alarm calls

(2 to 4 different individuals per group). However, some of the individuals were used repeatedly for the different alarm call types (1 to 4 times in total).

Response to different alarm call types We observed the response of the meerkats to the different alarm call types during natural predator encounters. We described the general behaviour of the meerkat, its immediate response, its total (including immediate and delayed) response, and the time frame within which the delayed response happened. In particular, for the immediate response we recorded whether receivers: (1) moved immediately for shelter; (2) scanned the area; (3) looked to the caller, and (4) followed its gaze direction; (5) scanned the sky; or (6) did not respond at all. On the total response we were interested whether receivers: (7) looked at the caller, and (8) then followed the caller’s gaze; (9) scanned the sky; and (10) gathered together. We collected data on 20 different events of alarm calling for each call type. As above, for each call type, the response of the same individual was only recorded once, but the responses of some individuals were recorded for more than one call type (1 to 4 times).

Response to playback experiments In addition to natural observations, we used playback experiments to find out how much information the meerkats gained from the acoustic structure of the calls, and how much they got from the posture of the alarm caller. We analyzed 18 playback experiments of each, medium urgency aerial, medium urgency terrestrial, panic, alert and moving calls. Some of these experiments had been conducted for other reasons (Manser et al., 2001), and were reanalyzed for this study. In addition, we performed two different types of playback experiments, in each of 6 groups, each playback experiment with a different individual. In the first set of playback experiments (A) calls were played from a hidden loudspeaker, without a caller visible. In the second set up (B) we played medium urgency aerial calls from a hidden loudspeaker beneath a model guard (cardboard with a meerkat picture of natural body size placed on a shrub of 1 to 1.5 m height). The model guard had its head directed in a specific position (90 degrees up into the sky). To ensure that the response was not unusual because of the model meerkat, we tested the playbacks with a live meerkat on guard. We also decided to play alert calls back, rather than medium urgency aerial calls, because our observational data showed that low urgency calls were more likely to draw the attention to the loudspeaker. In this third set up (C) we played the calls from a hidden loudspeaker beneath a live meerkat on sentinel duty. For this experiment we recorded

17

18

Marta B. Manser and Lindsay B. Fletcher

the behaviour of the live guard on video throughout the entire experiment to compare whether the caller gave additional visual signals which the receivers might have used. For the playback experiments we used high quality calls (six different examples per call type) that had previously been recorded with a directional Sennheiser microphone MKH 816 and a digital audio tape (DAT) walkman recorder Sony PCMM1 (see also Manser, 2001). We edited the calls using Canary 1.2.1 on a MacIntosh Powerbook G3-series. The calls were then played back with a DAT walkman connected to a Sony walkman loudspeaker SR A60. Call amplitude was adjusted to what we had observed during naturally occurring alarm calls. Playbacks were only conducted on foraging groups and when the subject was at least 10 m away from the closest bolthole. Furthermore, we did not play a call if there had been an alarm call or another disturbance that had caused the group to run to a bolthole within the last 15 min. We recorded the response of the subject to the played alarm call with a digital video camera Panasonic PV-DV910. We began filming the subject 30 sec. before the call was played, and continued until the animal resumed its previous behaviour, or relaxed otherwise. Typically we only conducted one playback experiment per week in the same group.

Statistics We analysed the responses of the meerkats by conducting a logistic regression model of SPSS v. 11.0 in order to test the influence of location and level of urgency on the frequency of specific responses. For the comparison of the different playback set ups we performed Fisher’s exact test because of the small sample sizes. The delay to respond was analysed with an ANOVA after a logarithmic transformation of the data to fulfill the requirements of the data being normally distributed (Sokal & Rolf, 1995).

Results Posture of alarm caller depending on the call type elicited by the predators The gaze direction of the alarm caller only depended on the direction of the predator approaching, and not on the urgency of the situation (Table 1). Alarm callers emitting calls to mainly aerial predators (medium urgency aerial and alert call) looked in 25% of events horizontally, and in the other 75% at an angle > 30 degrees towards the sky. They looked in a horizontal direction in all medium urgency terrestrial calls elicited by terrestrial predators. Predators eliciting a moving

A test on functionally referential alarm calls

call caused the alarmer to look horizontally in 85% of events, in the other 15% of events they looked at an angle > 30 degrees up into the sky. In the case of the high urgency panic call given to terrestrial and aerial predators as well as for bird alarms, they looked in 90% horizontally, and in two events (10%) towards the sky. Whether the alarm caller moved its position depended on the location of the predator approaching, and on the urgency of the situation (Table 1). In the high urgency context, including the medium urgency aerial and medium urgency terrestrial calls, and the non-predator specific panic call, in 92% the alarmer emitted the alarm call and ran immediately for shelter. In the less urgent situations, including the alert and moving call, alarm callers emitted the alarm call, looked towards the approaching predator, and only in 33% of the events they run for shelter. Alarm callers who changed their positions differed in their time delay to run for shelter depending on the location of the predator (F2,60 = 36.22, p ≤ 0.001), but not on the urgency of the situation (F1,60 = 0.17, p = 0.68) (Figure 2a). They moved immediately, on average within less than one second, when emitting the high urgency panic call. When giving medium urgency aerial or alert calls, independent of urgency alarm callers that ran for shelter did this within less than two seconds. When emitting medium urgency terrestrial and moving calls the alarm caller often delayed their movement, on average for about three seconds.

Table 1. (a) Behaviour and posture of alarm calling individual depending on the location of the predator and level of urgency of situation during natural predator encounter events. (b) Statistics for the different response categories of the alarm calling individual, posture and 2nd response moving. responses observed

location

level of urgency

total n observations

scan hori- moving zontally n (%) n (%)

air

high

20

4 (20)

call type medium urgency aerial

17 (85)

alert

air

low

20

6 (30)

4 (20)

moving animal

ter

low

20

17 (85)

9 (45)

medium urgency terrestrial

ter

high

12

12 (100)

11 (93)

panic

both

high

20

18 (90)

20 (100)

Statistics (logistic regression)

location

Chi-square

41.01

13.78

Significance

df = 2

p value

< 0.001

< 0.001

urgency

Chi-square

0.60

36.2

df = 1

p value

0.44

< 0.001

overall

Chi-square

df = 3

41.01

39.36

< 0.001

< 0.001

19

Marta B. Manser and Lindsay B. Fletcher

3.5 3

time in sec.

2.5 2 1.5 1 0.5 0

medium urgency aerial

alert

moving

medium urgency terrestrial

panic

medium urgency terrestrial

panic

Call types

a. 4.5 4

natural observations playback experiments

3.5 3

time in sec.

20

2.5 2 1.5 1 0.5 0 medium urgency aerial

b.

alert

moving

Call types

Figure 2. Time delay in the response to move by: (a) the alarm caller after emitting the alarm call, and (b) the receivers to alarm calls emitted during natural predator encounters and playback experiments depending on the different alarm call types (graph shows mean and SE values).

both

location Chi-square

panic

Statistics (logistic regression)

p value

high

high

low

20

20

20

p value

Chi-square p value

df = 1

overall df = 3

urgency Chi-square

df = 2

ter

Significance

ter

medium urgency terrestrial

20

moving animal

low

air

20

alert

high

level of total n urobgency servat ions

air

location

medium urgency aerial

call type

responses observed

34.50 <0.001

<0.001

33.66

0.006

18.32 <0.001

<0.001

16.01 3.15 0.40

0.09

2.93

0.78

0.49

0.04

1 (5)

6.62

10.26

16 (80) 1 (5)

14 (70) 4 (20) 1 (5)

2 (10) 9 (45) 3 (15)

3 (15) 11 (55) 3 (15)

13 (65) 3 (15) 1 (5)

2.08 0.43

0.41

0.67

0.47

1.51

0

0

0

0

1 (5)

5.55 0.13

0.02

5.32

0.86

0.31

2 (10)

0

6 (30)

3 (15)

2 (10)

n (%)

5.28 0.17

0.03

4.64

0.55

1.2

0

0

2 (10)

1 (5)

0

n (%)

9.61 0.04

0.16

1.96

0.02

8.43

1 (5)

9 (45)

7(35)

9 (45)

6 (30)

n (%)

3.30 0.34

0.7

0.15

0.2

3.22

0

4 (20)

2 (10)

3 (15)

2 (10)

n (%)

n (%)

n (%)

n (%)

Scan loudspeaker Follow gaze

n (%)

full response

Moving Scan area

Follow gaze

Scan Scan sky loudspeaker

1st response No response

8.92 0.04

0.78

3.1

0.04

6.68

0

0

1 (5)

4 (20)

2 (10)

n (%)

Scan sky

53.2 <0.001

0.03

4.83

<0.001

47.37

0

19 (95)

5 (25)

0

0

n (%)

gather

Table 2. (a) Behaviour of receivers depending on alarm call type and the context it was emitted during natural occurring predator encounters, when predator and alarm calling individual were visible. (b) Statistics for the different response categories of the receivers.

A test on functionally referential alarm calls 21

22

Marta B. Manser and Lindsay B. Fletcher

Response to naturally occurring alarm calls Receivers showed clear differences in their response to alarm calls occurring during natural predator encounters, depending on the call type and the urgency of the situation. In high urgency contexts when a panic or medium urgency aerial call had been given, receivers in 73% of events ran immediately to the next bolthole, without scanning the surrounding (Table 2). When the medium urgency terrestrial call was given, subjects began to move towards the caller in 70% of events, but in a less high speed than for the other urgent calls (Figure 2b). They often gathered together with the other group members to move to a burrow system, or another part of their home range. When hearing the low urgent alert or moving call, the subjects interrupted their current activity and scanned their surrounding in 50%, and only ran in 15% for shelter. 23% of the subjects did not show a response to these alarm calls elicited in low urgency contexts. The delay in the response to move depended on the location of the predator approaching (F2,56 = 11.18, p ≤ 0.001), and not the urgency of the situation (F1,56 = 2.78, p = 0.1). Receivers rarely looked at the caller and followed its gaze. As a first response, they looked at the caller in 5% of emitted high urgency calls, and 15% of emitted low urgency alert and moving calls. Only in the events of the low urgency calls did receivers consequently look in the same direction as the caller did (alert calls 33%, moving calls 67%). As the total response, receivers looked at the caller as frequently for the medium urgency aerial and medium urgency terrestrial calls (38%) as for the alert and moving call (40%). At this stage they were as likely to follow the gaze of the caller (48%) when they looked at the caller, as a first response. However, this means that in all responses to alarm calls receivers only followed the gaze of the alarm caller in 3% of all events as a first response, and in 11% as the total response.

Response to playback experiments of alarm calls A. Relying only on the acoustic structure of the call (no alarm caller around) When the different alarm call types were played from a hidden loudspeaker without a visible caller the receivers showed immediately the appropriate response as observed during natural predator encounters (Table 3). When we played the medium urgency aerial call the subjects ran in 60% of events for shelter as a first response. In 38% of events they scanned the area, and in 11% they looked to the loudspeaker. When the medium urgency terrestrial calls were given, subjects first scanned the area in 50% of events, or looked at the loudspeaker in 28%. Only one subject moved immediately. Most of them moved as a second response (Figure 2b) and then gathered together with the other group members in 89% of events. When the panic call was played, the subjects ran in 78% of events to the next bolthole

low

low

high

high

air

ter

alert

moving animal

medium urgency ter terrestrial

panic

5 (28) 1 (6)

10 (56) 9 (50) 2 (11)

0 2 (11)

0.01

Chi-square

overall

0.01

<0.001

0.001

p value

df = 1

6.38

0.01

11.20

11.23

Chi-square

urgency 32.12

<0.001

p value

df = 2

Significance

9.16

5 (28)

11 (61)

5 (28)

14 (78)

2 (11)

7 (38)

0.03

9.08

0.90

0.02

0.01

8.63

0

n (%)

n (%)

11 (60)

n (%)

29.11

df = 3

Follow gaze

0.009

11.51

0.01

6.28

0.04

6.28

0

0

0

4 (22)

0

n (%)

0.12

5.85

0.03

5.03

0.88

0.27

1 (6)

0

3 (17)

2 (11)

0

n (%)

n.a.

n.a.

n.a.

0

0

0

0

0

n (%)

0.004

13.22

0.055

3.69

0.01

9.11

2 (11)

6 (33)

11 (61)

3 (17)

3 (17)

n (%)

N.A.

N.A.

N.A.

0

0

0

0

0

n (%)

Scan loud- Follow speaker gaze

No response

full response Scan loud- Scan sky speaker

Moving

Scan area

1st response

Chi-square

18

18

18

18

18

total n observations

Statistics (logistic location regression)

both

high

level of urgency

medium urgency air aerial

call type

responses location observed

Table 3. Behaviour of receivers depending on alarm call type to playback experiments without model or visible caller.

0.001

16.30

0.23

1.46

<0.001

16.21

0

0

3 (17)

6 (33)

8 (44)

n (%)

Scan sky

0.001

50.22

0.015

5.88

<0.001

36.13

0

16 (89)

3 (17)

0

0

n (%)

gather

A test on functionally referential alarm calls 23

24

Marta B. Manser and Lindsay B. Fletcher

immediately. When we played the alert and moving calls the subjects interrupted their foraging behaviour to scan the surrounding in 58%, and only moved in 14% of all events as a first response. These playback experiments confirmed the natural observations that the delay of time to move depended on the location of the predator (F2,48 = 6.45, p = 0.003), but not on the urgency of the situation (F1,48 = 0.27, p = 0.6) (Figure 2b). Very seldom did the receivers look at the loudspeaker as a first response. When the aerial and panic calls were played subjects looked at the loudspeaker in 6%, more often in the case of terrestrial calls in 28% of events. They often looked at the loudspeaker with some delay, which resulted in the total response for aerial and panic calls in 15%, and for terrestrial calls in 47% of all playback experiments. B. Relying on the acoustic call structure and visual cues of a model guard When we played the medium urgency aerial calls beneath a model guard, the receivers did not change their behaviour much compared to when the same call type was played without a visible caller (Table 4, set up a). In four of the six playback experiments the subjects interrupted their foraging behaviour, and scanned their surrounding. In three events (50%) the subject moved, and in one event the subject looked immediately to the loudspeaker with the model guard. As a second response, in three of the six playbacks (50%) the subjects looked at the loudspeaker and consequently also at the model guard. Two of them (33%) then followed the gaze direction of the model guard. C. Relying on the acoustic call structure and visual cues of a live guard When we played the alert calls beneath a live sentinel, the receivers did not change their response in comparison to when only the call was played from a hidden loudspeaker (Table 4, set up b). As a first response, none of the six subjects moved for shelter, four (67%) scanned the area, and two (33%) looked at the loudspeaker. None of these subjects followed the gaze of the live sentinel on guard. As the total response, three subjects (50%) looked at the loudspeaker and consequently at the live guard, but none of them followed the guard’s gaze. This might have been because the live sentinel did not look in a specific direction, but instead moved its head all the time.

Discussion The alarm calling individual and the receiver in meerkats behaved differently according to the alarm call type elicited by approaching predators. In this study natural observations and playback experiments showed that meerkats respond

Statistics (Fisher’s exact test)

air

Life sentinel

p value

low

low

air

no alarm caller visible

high

p value

air

Model

high

level of urgency

Statistics (Fisher’s exact test)

air

Call type

no alarm caller visible

PB set up

responses observed

6

18

6

18

total n observations

0.20

0

5 (28)

0.17

0.60

4 (67)

11 (61)

0.40

3 (50)

1 (17)

n (%) 6 (33)

10 (55)

n (%)

0.055

2 (33)

0

0.60

1 (17)

2 (11)

n (%)

0.29

0

4 (22)

n.a.

0

0

n (%)

0.55

0

2 (11)

0.055

2 (33)

0

n (%)

n.a.

0

0

n.a.

0

0

n (%)

0.14

3 (50)

3 (17)

0.14

3 (50)

3 (17)

n (%)

n.a.

0

0

0.055

2 (33)

0

n (%)

full response Follow gaze

Scan loud- Follow speaker gaze

No response

Moving

Scan loud- Scan sky speaker

1st response Scan area

0.14

1 (17)

6 (33)

0.51

2 (33)

8 (44)

n (%)

Scan sky

n.a.

0

0

n.a.

0

0

n (%)

gather

Table 4. Comparison of the behaviour of receivers to playback experiments of: (a) medium urgency aerial calls without visible caller and a model guard above the loudspeaker; (b) alert calls without visible caller and a live guard above the loudspeaker.

A test on functionally referential alarm calls 25

26

Marta B. Manser and Lindsay B. Fletcher

immediately to the different alarm call types in the appropriate way depending on predator type and level of urgency. While visual cues as a result of the posture, and in particular the gaze of the alarm calling individual is unimportant in more urgent calls, it is sometimes used as secondary cue in less urgent situations. It appears that meerkats use specific information of the acoustic structure of urgent alarm calls to determine the direction of the approaching predator (aerial or terrestrial). Only in less urgent situations do meerkats look for additional cues from the alarm caller’s posture or direction of gaze. Meerkats do not rely more on the visual cues of the alarm caller, because of their open habitat and the location of their eyes, it may be possible for them to scan a huge part of the horizon. Consequently, they may be very likely to spot the approaching predator in the sky or on the ground as fast as looking at the alarm calling individual, to gain information about the direction of the approaching predator. The behaviour of the meerkats in the context of predator avoidance, in particular alarm calls, may indeed support the hypothesis that the localization function of vocalizations was important in the evolution of functional referential calls. In the most urgent situations meerkats only evolved one alarm call, the so called panic call, where they immediately run to the closest shelter (Manser, 2001). In medium urgency and high urgency contexts they use clear discrete call types for aerial and terrestrial predators, and the receivers show the appropriate response. In low urgency situations they emit alert calls, which are not as accurate in terms of functionally referential to predator type or location. However, by using visual cues from the alarm calling individual, though only as a secondary act, this additional information could easily be gained if necessary. One could argue that the selection pressure in the high urgency situations was so strong that the evolution of functionally referential signals is not surprising. It is an obvious advantage for a meerkat to know whether to run for the closest shelter to be safe immediately, or move with the group to another area of their home range. However, whether these functionally referential alarm calls denote to predator types or to the way of the predator approaching, be it speed or location, still is an open question. Although we found some support to the hypothesis that discrete vocalizations evolved to denote specific locations of subjects in the external environment from the functionally referential alarm calls in meerkats, the functionally referential food associated calls do not support this. Food calls have been described for few primate species (Hauser, 1998; Di Bitetti, 2003; Gros-Louis, 2003) and chickens (Evans & Evans, 1999), and these vocalisations in general appear to denote preferred kinds of food. When playing the food calls without visible caller or food source, the subjects approach the loudspeaker, as if they expected to find food. However, these vocalisations denote to an external object next to the caller, and not in a different direction

A test on functionally referential alarm calls

away from the caller, as in the case of the alarm calls. This is similar to the situation of the snake calls in vervet monkeys (Seyfarth & Cheney, 1980), or the recruitment calls in the meerkats (Manser, 2001; Manser et al., 2001). For both call types, the content of the signal is not about location, but rather specific characteristics of external objects. Therefore, these calls, described as functionally referential, do not support the localization hypothesis. It may suggest that other factors, such as specific categorizations in specific contexts are more important. This also brings back the discussion on what we define as functionally referential, and what the different animal species take as functionally referential. Until we can understand how animals categorise their ecological or social environment, it will be difficult for us humans to distinguish between more accurate functionally referential calls than others. One productive way to find out is to design specific experiments to recognize how animals categorise their environment and what kind of concepts they use.

Acknowledgement We thank T. H. Clutton Brock for being able to collect the data at his study site of the Cambridge Meerkat Project at Van Zyl’s Rus. We are grateful to Mr and Mrs H. Koetzee for permission to work on their farm. The members of the Mammal Research Institute, University of Pretoria, including J. Du Toit, and M. Haupt, were a big help in logistical questions. Many thanks to all the assistants and volunteers working on the farm at the Meerkat project. We are grateful to B. König and K. McComb for discussions, and A. McElligot for improvements on the manuscript. The research was funded by a grant given to MM from the Swiss Nationalfonds Nr. 823A–53475. The research was conducted under the ethical laws of the University of Pretoria.

References Abry, C. & J.‑L. Schwartz (2004). Language origins: A puzzle with many pieces. In: J.‑L. Schwartz (ed.). Special issue of Primatologie, 6, 355–368. Beynon, P, & Rasa, O. A. E. (1989). Do dwarf mongoose have a language?: warning vocalizations transmit complex information. South African Journal of Science, 85, 447–450. Clutton-Brock, T. H., Gaynor, D., Kansky, R., MacColl, A. D. C., McIlrath, G., Chadwick, P., Brotherton, P. N. M., O’Riain, J. M., Manser, M. & Skinner, J. D. (1998). Costs of cooperative behaviour in suricates (Suricata suricatta). Proc. R. Soc. Lond. B., 265, 185–190. Crockford, C. & Boesch, C. (2003). Context specific calls in wild chimpanzees, Pan troglodytes verus: Analysis of barks. Animal Behaviour, 66, 115–125. Di Bitetti, M. S. (2003). Food associated calls of tufted capuchin monkeys (Cebus apella nigritus) are functionally referential signals. Behaviour, 140, 565–592.

27

28

Marta B. Manser and Lindsay B. Fletcher

Evans, C. S., Evans, L. & Marler, P. (1993). On the meaning of alarm calls; functional reference in an avian vocal system. Animal Behaviour, 46, 23–28. Evans, C. S. & Evans, L. (1999). Chicken food calls are functionally referential. Animal Behaviour, 58, 307–319. Fischer, J, (1998). Barbary macaques categorize shrill barks into two call types. Animal Behaviour, 55, 799–807. Hauser, M. D. & Marler, P. (1993). Food associated calls in rhesus macaques (Macaca mulatta): I. Socioecological factors influencing call production. Behavioral Ecology, 4, 194–205. Hauser, M. D. (1996). The evolution of communication. Cambridge, MA: MIT Press. Macedonia, J. M. & Evans, C. S. (1993). Variation among mammalian alarm call systems and the problem of meaning in animal signals. Ethology, 93, 177–197. Manser, M. B. (1998). The evolution of auditory communication in Suricates (Suricata suricatta). Ph.D. thesis, Cambridge University. Manser, M. B. (2001). The acoustic structure of suricates’ alarm calls varies with predator type and the level of response urgency. Proc. R. Soc. London Ser. B, 268, 2315–2324. Manser, M.B., Bell, M. B. & Fletcher, L. (2001). The information that receivers extract from alarm calls in suricates. Proc. R. Soc. London Ser. B, 268, 2485–2491. Seyfarth, R. M., Cheney, D. L. & Marler, P. M. (1980). Vervet monkey alarm calls: Semantic communication in a free-ranging primate. Animal Behaviour, 28, 1070–1094. Slobodchikoff, C. N., Fischer, C. & Shapiro, J. (1986). Predator-specific alarm calls of prairie dogs. American Zoologist 26, 557–571. Sokal, R. R. & Rohlf, F. J. (1995) Biometry, 3rd edn. New York: W. H. Freeman & Co. Tomasello, M., Call, J. & Hare, B. (1998). Five primate species follow the visual gaze of conspecifics. Animal Behaviour, 55, 1063–1069. Zuberbuehler, K, (2000), Referential labelling in Diana monkeys. Animal Behaviour 59, 917–927.

About the authors Marta Manser took her PhD at the Zoological Institute, Cambridge University, UK on the ‘Acoustic Communication in Meerkats’. She then moved for a Postdoc in the Department of Psychology, University of Pennsylvania, USA, to continue the work on ‘Cognition and Communication on Meerkats’. She presently leads a research group continuing her work on Cognition and Communication on the Meerkats and some other social mammals at the Verhaltensbiologie, Zoologisches Institut, University of Zurich, Switzerland. She is also co-director of the Kalahari Meerkat Research Project. Lindsay Fletcher took her undergraduate studies in biology at the University of Pennsylvania, USA. She then spent half a year working on the Kalahari Meerkat Project collecting field data and performing experiments on the meerkats. Presently, she conducts a PhD in Clinical Psychology, University of Nevada, Reno, USA on ‘Combining Elements of Mindfulness from Eastern Meditation Practices with Behavioural Psychology’.

Mirror neurons, gestures and language evolution Leonardo Fogassi and Pier Francesco Ferrari Dipartimento di Neuroscienze, Università di Parma / Dipartimento di Psicologia, Università di Parma

Different theories have been proposed for explaining the evolution of language. One of this maintains that gestural communication has been the precursor of human speech. Here we present a series of neurophysiological evidences that support this hypothesis. Communication by gestures, defined as the capacity to emit and recognize meaningful actions, may have originated in the monkey motor cortex from a neural system whose basic function was action understanding. This system is made by neurons of monkey’s area F5, named mirror neurons, activated by both execution and observation of goal-related actions. Recently, two new categories of mirror neurons have been described. Neurons of one category respond to the sound of an action, neurons of the other category respond to the observation of mouth ingestive and communicative actions. The properties of these neurons indicate that monkey’s area F5 possesses the basic neural mechanisms for associating gestures and meaningful sounds as a pre-adaptation for the later emergence of articulated speech. The homology and the functional similarities between monkey area F5 and Broca’s area support this evolutionary scenario. Keywords: mirror neurons, monkey, language evolution, gesture, Broca’s area

Introductory remarks In this article we will provide neurophysiological evidence and hypotheses on the possible evolution of language from a neural motor system involved in action recognition. There are different theories about language evolution. Some authors claim that language evolved from monkey vocalization (Cheney & Seyfarth, 1990; Mac Lean, 1993; Hauser, 1996; Ghazanfar & Hauser, 1999), others from gestural communication (Corballis, 1992, 2002, 2003; Amstrong et al., 1995; Rizzolatti & Arbib, 1998).

30

Leonardo Fogassi and Pier Francesco Ferrari

Others even deny a direct evolution of language from some monkey precursor, considering the former as a completely new acquisition of the human gender, with characteristics that are completely different from any other animal cognitive function (Chomsky, 1986). In the next sections we will examine the possible mechanisms and processes that might have been involved in the evolution of a vocal communicative system so complex as the human speech. We will attempt to draw an evolutionary scenario in which gestures and speech emerged, supported by specific neurophysiological mechanisms.

Vocal communication in primates and its limitations for language evolution In the past few decades, several studies have attempted to compare vocalizations of nonhuman primates with human speech (Ghazanfar & Hauser, 1999) in order to find the possible common ground from which human speech could have evolved. These investigations were aimed to search commonalities between the two forms of vocalizations. One of the main features of monkey vocal communication that parallels human language is the possibility to be referential. Referential communication has a representational value, as vocalization is used as a symbol for referring to object or events external to the signaler. The work by Cheney & Seyfarth (1982) showed how different alarm-calls in vervet monkeys provide listening conspecifics with information about the type of predator approaching. Data regarding referential aspects of vocalization in primates have been accumulated in the last decade (see Ghazanfar & Hauser, 1999 for a review). Despite some encouraging data, however, several reasons argue against nonhuman primate vocal calls as precursor to humans speech (see also Hauser et al., 2002). First, vocal calls are largely related to an emotional state. This observation is also supported by neuroethological studies which show that monkey vocalization is under the neural control of the primitive mesial cingulate circuit, that is mainly involved in emotional behavior (Jürgens, 1995; Jürgens, 2002). Even in our closest relative, the chimpanzee, it has been suggested that vocal calls seem impossible to be produced without the appropriate emotional state (Goodall, 1986). Thus, the voluntary control of vocal production, so typical of human speech, seems to be unlikely present in non-human primates. These findings are also in accord with the observations that there is very little evidence of intentional vocal communication in nonhuman primates (Corballis, 2003). Second, monkeys’ vocal calls show very little flexibility in terms of the possibility of creating new sounds other than those already present in their vocal repertoire. Lastly, the number of vocal signals used for referential communication is very small, mainly restricted to objects or present events (Hauser et al., 2002).

Mirror neurons, gestures and language evolution

Gestural communication as a suitable substrate for language evolution Before introducing an hypothesis on language evolution from gestures, we need to clarify that the term gesture is used here and in the following sections for indicating both goal-related actions (e.g. grasping an object with the hand) and communicative oro-facial and brachio-manual movements. A number of old and new comparative anatomical and neurophysiological data seem to support a gradual evolution of language from gestural communication (Paget, 1963; Corballis, 1992, 2000, 2002; Kimura, 1993) rather than from vocalization. The evolution of facial communicative gestures in non-human primates has been largely investigated in the pioneering studies of van Hooff (1962;, 1967) by comparing in several species the functional role of each facial expression and its meaning. Only more recently, communication through brachio-manual gestures in apes have been documented (de Waal, 1982; Tomasello et al., 1997; Tanner & Byrne, 1996). Interestingly, it has been shown that chimpanzees extensively use brachiomanual gestures with the purpose to communicate to each other. Overall, the studies on gestural communication in monkeys and apes revealed that the production of most gestures, especially oro-facial, probably is not only a hard-wired phenomenon, but can be the result of processes of individual and social learning (Parr & Maestripieri, 2003). This latter aspect suggests that nonhuman primate gestures might reflect not only emotional states but also the cognitive processes underlying their production. Why then gestural communication seems more suitable as a substrate for building up a human-like communicative system? Several lines of research support this hypothesis. First, differently from vocal communication, gestures are often used for dyadic social interactions. Second, it has been shown that gestural communication involving brachiomanual actions is very flexible and combinatorially richer than vocal communication (Corballis, 2003). Studies on chimpanzees showed that the same brachio-manual gesture can be used for different purposes and different gestures can be used for indicating the same goal (see Parr & Maestripieri, 2003 for a review). Third, gestural communication can convey a relational content less endowed with strong emotional involvement. This content is expressed both with brachio-manual and oro-facial gestures (see Van Hooff, 1962, 1967; Parr & Maestripieri, 2003). Fourth, gestures can be intentional as Ploji first described in chimpanzees (1978, 1984). Further studies by Tomasello and colleagues (1985, 1989) identified several intentional gestures in chimpanzees such as, for example, to gather other’s individual attention and to communicate information on impending behavior.

31

32

Leonardo Fogassi and Pier Francesco Ferrari

How the brain codes gestures Communication between two individuals requires two basic properties, strictly linked one with the other: the capacity to emit a gesture and the capacity to recognize it. These two capacities must rely on neural circuits, at least part of which must belong to the cerebral cortex, because of its pivotal role in voluntary behavior. It is well known that in humans the main cortical areas involved in language production and perception are located in the frontal and temporal cortices. Although in monkeys we cannot speak of a language function, we can however look for the presence of neural circuits involved in gesture production and recognition. To do this, we must consider that gestures belong to the field of meaningful actions. In particular, communicative gestures, although often devoid of a target object, are strongly endowed with a meaning. We will first try to demonstrate in monkeys which are the areas linked to action production and perception. Subsequently we will propose, on the basis of recent findings, that production and perception of communicative actions can derive from the primitive system for action execution/perception.

Action vs. movement To use the word action when referring to neural circuits of cerebral cortex is by no means obvious. This because of the assumption, kept until no long ago, and for some researchers still valid, that the sector of frontal cortex devoted to motor control has, as its main aim, the function of coding movements, that is joints displacements. According to this view the main scientific problem has been, for many years, that of finding whether motor cortex neurons could, at the single or the population level, code parameters of the movement, such as force, amplitude and direction. Differently from this, another view maintains that the main function of the motor cortex is that of coding goal directed actions (see Rizzolatti et al., 2000). This type of code renders easier the task of the motor cortex that, in this case, would consist in defining a limited number of motor goals and not an exponential number of movements. Once goals are coded, they can be implemented by executive areas. The neurophysiological and neuroanatomical data of the last twenty years show that this latter view can be actually applied to many areas of the motor cortex. Actions are coded in several areas of premotor cortex, and, more extensively, in the parieto-frontal circuits linking specific premotor and parietal areas (see Rizzolatti et al., 1988, 1998, Rizzolatti & Luppino, 2001). Coding of goaldirected actions at the single neuron level was studied in more detail in the ventral

Mirror neurons, gestures and language evolution

premotor cortex of the macaque monkey, namely in area F4 and F5. Microstimulation and neuron recording of area F4 show that this area is mainly involved in axial and proximal actions toward spatial targets (Gentilucci et al., 1988; Fogassi et al., 1996). Microstimulation and recording experiments carried out in area F5 show that this area contains a partially overlapped representation of hand and mouth movements. Single neuron recording experiments show that in F5 different types of hand and mouth actions are coded such as grasping, manipulating, holding, tearing. Inside these categories of purely motor neurons there are also subcategories whose neurons code a more specific goal, such as the way in which an object is grasped (for example precision grip or whole hand prehension) (Gentilucci et al., 1988; Rizzolatti et al., 1988).

Mirror neurons and gestural communication In area F5 there are also visuomotor neurons that become active not only when the monkey performs hand actions, but also when it observes another individual making similar actions (Gallese et al., 1996; Rizzolatti et al., 1996; see also Fogassi & Gallese, 2002). These neurons have been named “mirror neurons”. These neurons do not respond either to simple object presentation or to the vision of a hand mimicking an action without the target object. They do not respond also to actions mimicked by using tools, such as pliers. Thus the optimal visual stimulus that triggers mirror neurons response is a hand-object interaction, that is the observation of a goal directed action. A further demonstration that these neurons code the goal of the action, and not simply a pictorial representation of a biological motion, was given by a study in which only the first part of the hand action could be seen by the monkey, the final, crucial part being performed behind a screen (Umiltà et al., 2001). Half of mirror neurons tested during this ‘hidden’ condition responded even during the period in which the hand-object interaction was not visible. These neurons seem to be able to keep in memory the presence of the object and internally reconstruct the not visible part of the action. The most interesting property of mirror neurons consists in the fact that in most of them there is a good congruence between the seen and the executed actions effective in activating them. Because of this congruence it was hypothesized that mirror neurons, by matching action observation with action execution, allow understanding of actions made by others. This latter capacity is not simply limited to a recognition of motor patterns, but it extends also to the goal of the observed action. Every time we observe an action made by another individual, we are able to understand its goal because the observed action is matched on our internal

33

34

Leonardo Fogassi and Pier Francesco Ferrari

representation of it, that, in turn, is endowed with the knowledge of the goal. The presence of this matching system is strongly suggestive of its possible role in interindividual communication, accepting that this latter was originally based on gestures. In fact, a communicative gesture made by an actor (the sender) retrieves in the observer (the receiver) the neural circuit encoding the motor representation of the same gesture. This allows the receiver to understand the message of the observer and, perhaps, to begin a response (see Rizzolatti & Arbib, 1998). The question now arise on whether there are data indicating the possibility of the above mentioned link between the action observation/execution system and the communication system. Two recently discovered categories of mirror neurons in monkey could explain the transition from a basic action understanding neural system to a system endowed with features typical of language. These two categories are represented by audio-visual mirror neurons and mouth mirror neurons. – Audio-visual mirror neurons become active when monkeys not only observe, but also hear the sound of an action (audio-visual mirror neurons) (Kohler et al., 2002). The response of these neurons are specific for the type of action seen and heard. For example, they respond to peanut breaking when the action is only observed, only heard or both heard and observed, and do not respond to the vision and sound of another action, or to unspecific sounds (see Figure 1). Note that often the neuron discharge to the simultaneous presentation of both the visual and the acoustic inputs is higher than the response to either of the inputs, when presented alone. These data have two important implications: (a) the acoustic input reaching the motor cortex of a listener allows him to retrieve the action representation present in this area, thus accessing to action meaning. This is probably the process occurring during listening to spoken language; (b) audiovisual mirror neurons code actions independently of whether these actions are seen, heard or performed. The capacity of representing action content independently of the modality used to access this content is typical of language. – Mouth mirror neurons are a category of mirror neurons that activate when a monkey observes and execute mouth ingestive actions or, in some neurons, communicative actions (see Figure 2) (Ferrari et al., 2003). Most mouth mirror neurons respond to observation of ingestive actions such as biting, tearing with the teeth, sucking, licking, etc. An example of an ‘ingestive’ mouth mirror neuron is shown in Figure 3a. They show the same specificity of hand mirror neurons. They do not respond to simple object presentation or to mouth mimed actions. Their visual response is often very specific. When the congruence between the visual and the motor response is analysed, most of these neurons (about 90%) show a very good congruence. A smaller but significant percent

Mirror neurons, gestures and language evolution

Figure 1. Example of mirror neuron responding to the sound of an action. Rastergrams are shown together with spike activity function. Text above each rastergram describes the sound or action used to test the neuron. Vertical lines indicate the time when the sound occurred. V + S: vision and sound condition. The monkey observes and hears the experimenter ripping a sheet of paper; S: sound condition. The monkey only hears the experimenter ripping a sheet of paper; CS1 and CS2: control sounds. Traces under the spike density functions in S and in CS conditions are oscillograms of the sounds used to test the neurons (modified from Kohler et al., 2002).

of mouth mirror neurons respond specifically to the observation of mouth communicative actions belonging to the monkey repertoire, such as lips-smacking, lips protrusion or tongue protrusion. Mouth mirror neurons of this sub-category do not respond, or respond very weakly, to the observation of ingestive actions. Figure 3b and c show two examples of ‘communicative’ mouth mirror neurons. Note that all these actions have an affiliative meaning. Mouth communicative mirror neurons with a response to observation of threatening or aggressive gestures were never found. The motor response of mouth communicative mirror neurons is more complex. In those neurons in which it was possible to test the motor response during monkey execution of communicative actions, there was a clear activation. However, most neurons responded also when the monkey executed ingestive actions. Thus there is an apparent incongruence between the visual and the motor response or, at least, the motor response seems more unspecific than the visual one. However, when compared with the effective observed communicative actions,

35

36

Leonardo Fogassi and Pier Francesco Ferrari

Figure 2. Examples of ingestive and communicative actions performed by the experimenter in front of the recorded monkey (left column); same gestures made by the monkey (right column). From top to bottom: grasping of a piece of food; lips-protruded face. (Modified from Ferrari et al. 2003)

the effective executed ingestive actions appeared motorically similar to the observed ones. For example, a neuron responding to the observation of lips-smacking responded also to execution of a sucking action. In both cases an alternation between lips protrusion and retraction and between jaw opening and closure is observed. These visuo-motor properties of communicative mouth mirror neurons suggest that during evolution part of the motor structures of ingestive behavior could have been exploited in order to be used for communicative behavior. This transition, probably ‘photographed’ in the properties of communicative mouth mirror neurons, occurs in a region (ventral premotor cortex) endowed with the apparatus controlling the execution of ingestive behavior (see Ferrari et al., 2003). The hypothesized evolution of the communicative lateral system from an ingestive motor repertoire seems witnessed by ethological studies conducted in several non-human primates. These studies suggest that some communicative gestures such as lips-smacking and pucker face very likely evolved from movements aimed to remove and eat particles, such as skin parasites, from the fur of group mates during grooming sessions. This suggestion is corroborated by the observation that

Mirror neurons, gestures and language evolution

Figure 3. Examples of mouth mirror neurons. In each panel the rasters and the histograms represent the neuron response during a single experimental condition. The histogram represents the average of ten trials. Rasters and histograms are aligned with the moment in which the mouth or the hand of the experimenter (observation conditions) or the mouth of the monkey (motor conditions) touched the food or when the food is abruptly presented (presentation conditions). During observation of communicative actions the rasters and histograms alignment was made with the moment in which the action was fully expressed. Ordinates: spikes/sec; abscissae: time; bin width: 20 ms. A. Ingestive mouth mirror neuron. Top. The experimenter approaches with his mouth the food held on a support, grasps it with the teeth and holds it. Middle. The experimenter grasps with the hand a piece of food placed on a support and holds it. Bottom. The experimenter moves a piece of food to the monkey’s mouth; the monkey grasps and holds it with its teeth. B. Communicative mouth mirror neuron. Top. The experimenter makes a lip-smacking action looking at the monkey. Middle. The experimenter protrudes his lips looking at the monkey. Bottom. The experimenter moves a piece of food toward the monkey’s mouth; the monkey protrudes its lips and takes the food. C. Communicative mouth mirror neuron. Left. The experimenter protrudes his lips looking at the monkey. Right. During the experimenter lips protrusion the monkey responds almost simultaneously to the experimenter gesture by making a lip-smacking action. (Modified from Ferrari et al. 2003)

37

38

Leonardo Fogassi and Pier Francesco Ferrari

the beginning of a grooming session can be preceded or accompanied by a lipssmacking action without ingestion (see Van Hooff, 1962, 1967; Maestripieri, 1996). Interestingly enough, this latter lips-smacking produces a more pronounced sound than that associated to ingestion, as to underline a different meaning. Thus, lipssmacking appears to be a ritualization of an ingestive action, losing its behavioral meaning related to feeding and achieving an affiliative meaning. The transformation of part of the ingestive system in a communicative system is also in agreement with the proposal by Mac Neilage (1998) of an evolution, inside the lateral cortical system, of the syllabic frame from the cyclic mandibular open-close alternation typical of food ingestion. In his view the precursor of human syllables are monkey lips-smacks and tongue-smacks. A further, although still preliminary, support to the appearance, in the monkey, of a lateral communicative system, is represented by mouth communicative mirror neurons responding to sounds. The example reported in Figure 4 shows a communicative mirror neuron that responds to the vision of a communicative gesture such as lips protrusion accompanied by a vocalization made by the experimenter. This neuron activates also when the monkey only sees the same gesture or only hears the vocalization. The presence of mouth mirror neurons suggests that in ventral premotor cortex the action understanding system has been exploited to evolve in an oro-facial communicative system that, differently from that controlled by the mesial cortex, is not linked anymore to strong emotional behavior, but to affiliative behavior. These findings also favor the view that a gestural communicative system has probably preceded the vocal-based communicative system in the human evolution.

Homology between F5 and Broca’s area The discovery in area F5 of audio-visual mirror neurons and of mouth communicative mirror neurons is in good agreement with the proposed homology between F5 and Broca’s area. This homology is based on several data. (1) Cytoarchitectonically, both area 44 (part of Broca’s area) and area F5 are dysgranular (see Petrides & Pandya, 1994). (2) Both Broca’s area and F5 have a mouth and hand representation. Infact several brain imaging experiments demonstrated that Broca’s area, beyond its classical role in speech production is also involved in hand movement tasks. For example it is activated by the execution of hand movements, mental imagery of grasping actions and hand mental rotation tasks (Bonda et al.1994, Parsons et al., 1995, Chollet et al., 1991, Grafton et al., 1996, Iacoboni et al., 1999). (3) Area F5 is endowed with a system for the control of laryngeal muscles and of oro-facial synergisms (Hast et al., 1974). (4) Both area F5 and Broca’s area are activated

Mirror neurons, gestures and language evolution

Figure 4. Example of a communicative mouth mirror neuron responding to sound. During listening to the experimenter vocalization rasters and histograms are aligned with the moment in which the experimenter begun to vocalize. During observation of the communicative action rasters and histograms are aligned with the moment in which the action was fully expressed. Top. The experimenter protrudes his lips looking at the monkey and emitting a vocalization. The type of vocalization emitted by the experimenter was a prolonged and deep ‘uuh’ call, similar to the ‘coo’ calls of macaques. Middle. The experimenter protrudes his lips looking at the monkey. Bottom. The experimenter emits the same vocalization as in the first condition standing 1 m behind the monkey.

39

40

Leonardo Fogassi and Pier Francesco Ferrari

during observation of hand and mouth actions. Many EEG, TMS, MEG and brain imaging experiments demonstrated an activation of inferior frontal cortex when subjects were required to observe goal related actions made by another individual (see for refs. Rizzolatti et al., 2001). Thus, it is confirmed the existence, also in humans, of a mirror system for action understanding. In particular recent fMRI experiments demonstrated that the inferior frontal gyrus is activated both when subjects observe biting action (Buccino et al., 2001, 2004) and when they observe other individuals performing silent speech (Campbell et al., 2001; Buccino et al., 2004; Calvert & Campbell, 2003). In accordance to the idea of a gradual evolution of human’s speech area from a monkey precursor, it is noteworthy that recently a left anatomical asymmetry in the ventral premotor area 44, similar to that present in humans, has been described in apes (Cantalupo & Hopkins, 2001).

Gestures and sound association for the evolution of human speech All these observations strongly suggest that the cortical region precursor of Broca’s area was initially endowed with the capacity to execute and understand hand and mouth actions, together with elements that could indicate primitive forms of dyadic communication. Although in the monkey we have evidence only of a limited number of neurons involved in oro-facial communicative actions, we cannot ignore the fact that these neurons are found in an area in which both oro-facial and brachio-manual gestures are motorically coded, and often strictly associated. Clearly, the brachio-manual gestures described in macaques premotor area F5 are not strictly communicative. In addition, it is known that in macaques the brachiomanual gestural repertoire is not particularly rich. However, on the basis of the visuomotor properties of F5 neurons previously described, it is reasonable to hypothesize that in the emergence of a primitive use of brachio-manual gestures for communication, area F5 was already endowed with the features necessary for controlling this type of communication (i.e., voluntary motor control of hand actions and hand mirror neuron system), and for integrating it with the oro-facial visuo-motor system. However, an important gap to be filled in the route leading to human language evolution is that of the incorporation of sound production into the gestural communicative system. We propose that an early step of this process consisted in a casual production of a sound during the expression of a communicative gesture. Individuals displaying this casual association would have benefit from it because the sound can give more salience to the gesture, so to improve the transfer of information to other

Mirror neurons, gestures and language evolution

group members. Beyond the salience, the possibility to use several oro-facial and brachiomanual gestures in association with utterances represents a big advantage because allow to combine them in a flexible manner, thus creating a richer vocabulary. In this respect monkey area F5 could have some evolutionary advantage, since it is endowed with a large population of neurons controlling both mouth and hand actions. Looking at our closest relatives, the chimpanzees, it is clear that in this species there is a new achievement in the communicative brachiomanual repertoire (see Parr & Maestripieri, 2003), paralleled by a more complex control on the vocal communicative system (Arcadi, 1996, 2000). However, the gesture/vocalization association is largely unexplored in apes. In parallel with the evolution of a more complex motor output, as proposed above, also perceptual representations became more sophisticated. In fact a newly acquired combination of gesture and sound necessarily produces a new motor representation. At this point the mirror neuron matching mechanism should come into play, allowing understanding of the meaning of the new gesture-sound combination. This process would not start from scratch, but it would originate from an already established gesture recognition system. This system, for example, allows a monkey to recognize that another monkey is performing an affiliative gesture (e.g. pout face). If, however, the agent produces a sound together with the pout face, in the brain of the observer a perceptual association between gesture and sound can be easily created if the two stimuli are repetitively associated. This ‘new’ gesture would be endowed with the meaning previously provided by the observation of the gesture alone. This capacity of extending the meaning of a gesture to new gestures through a process of generalization has been recently discovered in a new category of F5 mirror neurons (Ferrari et al., 2005). The demand of an efficient communicative system is not only relying on the increased complexity of the motor output, but also on the parallel development of more sophisticated perceptual representations. A parsimonious cortical system that achieves this correspondence is well exemplified by the F5 mirror system, in that it generates motor representations of hand and mouth actions that can be used for acting, perceiving and understanding. Once a primitive communicative system based on an association between gestures and vocalization took place, a further step in both the motor and sensory development of this system probably occurred through the acquisition of a more sophisticated phonatory mechanism, which allowed the association of a gesture with a specific sound. At this stage of language evolution the possibility of creating a theoretically infinite set of combinations rendered the phonatory system alone more efficient than the previous vocal-gestural system. This stage was crucial for the development of a speech-based communicative system (see Rizzolatti & Arbib, 1998).

41

42

Leonardo Fogassi and Pier Francesco Ferrari

Although it is obviously very hard to mark when the transition to a highly independent phonatory system did occur, it is noteworthy that also in modern humans, spoken language is commonly accompanied by gestures that enrich the communicative message (McNeill, 1992; Goldin-Meadow, 1999). This favors the idea that in humans the control of speech and gestures can partially overlap in the same cortical areas (e.g., Broca’s area). A strong evidence of this overlap is given by brain imaging studies showing an activation of the inferior frontal cortex in deaf people during production of meaningful signs (Petitto et al., 2000). In conclusion, we have drawn a hypothetical evolutionary pathway from monkey communication toward human language starting from neurophysiological data. The tenet of this hypothesis is that the mirror system although evolved originally in a non communicative system as a mechanism for action understanding, enhanced its capacity to convey information by combining brachio-manual and oro-facial gestures with vocalization. Monkey area F5 appears to be the strongest candidate for the convergence of these functions in the same brain region, also in the light of its anatomo-functional homology with human Broca’s area. The crucial step in this evolutionary scenario would have been a reorganization of the mirror system for the intentional control of gestures and vocalization.

Acknowledgement This work was supported by MIUR, Italian CNR, French CNRS and ESF Programme Eurocores (OMLL, 01‑FOS).

References Arcadi, A. C. (1996). Phrase structure of wild chimpanzees pant hoots: patterns of production and interpopulation variability. American Journal of Primatology, 39, 159–178. Arcadi, A. C. (2000). Vocal responsiveness in male wild chimpanzees: Implications for the evolution of language. Journal of Human Evolution, 39, 205–223. Armstrong, D. F., Stokoe, W. C., & Wilcox, S. E. (1995). Gesture and the nature of language. Cambridge: Cambridge University Press. Bonda, E., Petrides, M., Frey, S., & Evans, A. C. (1994). Frontal cortex involvement in organized sequences of hand movements: Evidence from a positron emission tomography study. Society of Neuroscience Abstracts, 20, 353. Buccino, G., Binkofski, F., Fink, G. R., Fadiga, L., Fogassi, L., Gallese, V., Seitz, R. J., Zilles, K., Rizzolatti, G., & Freund, H. J. (2001). Action observation activates premotor and parietal areas in a somatotopic manner: An fMRI study. European Journal of Neuroscience, 13, 400–404. Buccino, G., Lui, F., Canessa, N., Patteri, I., Lagravinese, G., Benuzzi, F., Porro, C. A., & Rizzolatti, G. (2004). Neural circuits involved in the recognition of actions performed by non-conspecifics: An fMRI study. Journal of Cognitive Neuroscience.

Mirror neurons, gestures and language evolution

Calvert, G. A., & Campbell, R. (2003). Reading speech from still and moving faces: neural substrates of visibile speech. Journal of Cognitive Neuroscience, 15, 57–70. Campbell, R., MacSweeney, M., Surguladze, S., Calvert, G. A., Mc Guire, P., Suckling, J., Brammer, M. J., & David, A. S. (2001). Cortical substrates for the perception of face actions: An fMRI study of the specificità of activation for seen speech and for meaningless lower-face acts (gurning). Brain Research Cognitive Brain Research, 12, 233–243. Cantalupo, C., & Hopkins, W. D. (2001). Asimettric Broca’s area in great apes. Nature, 414, 505. Cheney, D. L., & Seyfarth, R. M. (1982). How vervet monkeys perceive their grunts. Animal Behaviour, 30, 739–751. Cheney, D. L., & Seyfarth, R. M. (1990). How monkeys see the world: inside the mind of another species. Chicago: Chicago University Press. Chollet F., DiPiero, V., Wise, R. J., Brooks, D. J., Dolan, R. J. & Frackowiak, R. S. (1991). The functional anatomy of motor recovery after stroke in humans: A study with positron emission tomography. Ann. Neurology 29, 63–71. Chomsky, N. (1986) Knowledge of language: Its nature, origin and use. Praeger. Corballis, M. C. (1992). On the evolution of language and generativity. Cognition, 44, 197–226. Corballis, M. C. (2002). From hand to mouth: The origins of language. Princeton University Press. Corballis, M. C. (2003). From mouth to hand: Gesture, speech, and the evolution of right-handedness. Behavioral and Brain Sciences, 26, 199–260. De Waal, F. B. M. (1982). Chimpanzee politics. Harper & Row, New York. Ferrari, P. F., Gallese, V., Rizzolatti, G., & Fogassi, L. (2003). Mirror neurons responding to the observation of ingestive and communicative mouth actions in the monkey ventral premotor cortex. European Journal of Neuroscience, 17, 1703–1714. Ferrari, P. F., Rozzi, S., & Fogassi, L. (2005). Mirror neurons responding to the observation of actions made with tools in monkey ventral premotor cortex. Journal of Cognitive Neuroscience, 17(2), 212–226. Fogassi L., Gallese V., Fadiga L. & Rizzolatti G. (1996). Space coding in inferior premotor cortex (area F4): facts and speculations. NATO ASI Series: Neural basis of motor behaviour. Lacquaniti, F. & Viviani, P. (eds.). Dordrecht: Kluwer Academic Publishers, pp. 99–120. Fogassi, L. & Gallese, V. (2002). The neural correlates of action understanding in non-human primates. In Stamenov, M. I. & Gallese, V. (eds.) Mirror neurons and the evolution of brain and language. Amsterdam: John Benjamins, pp. 13–35. Gallese, V., Fadiga, L., Fogassi, L., & Rizzolatti, G. (1996). Action recognition in the premotor cortex. Brain, 119, 593–609. Gentilucci, M., Fogassi, L., Luppino, G., Matelli, M., Camarda, R., & Rizzolatti, G. (1988). Functional organization of inferior area 6 in the macaque monkey: I. Somatotopy and the control of proximal movements. Experimental Brain Research, 71, 475–490. Ghazanfar, A. A., & Hauser, M. D. (1999). The neuroethology of primate vocal communication: Substrates for the evolution of speech. Trends in Cognitive Sciences, 3, 377–384. Goldin-Meadow, S. (1999). The role of gestures in communication and thinking. TICS 3: 419– 429. Goodall, J. (1986). The chimpanzees of Gombe: Patterns of behavior. Cambridge: Harvard University Press. Grafton, S. T., Arbib, M. A., Fadiga, L., & Rizzolatti, G. (1996). Localization of grasp representations in humans by positron emission tomography. 2. Observation compared with imagination. Experimental Brain Research, 112, 103–111.

43

44

Leonardo Fogassi and Pier Francesco Ferrari

Hast, M. H., Fischer, J. M., Wetzel, A. B., & Thompson, V. E. (1974). Cortical motor representation of the laryngeal muscles in Macaca mulatta. Brain Research, 73, 229–240. Hauser, M. D. (1996). The evolution of communication. Cambridge, MA: MIT Press. Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The faculty of language: What is it, who has it, and how did it evolve? Science, 298, 1569–1579. Iacoboni, M., Woods, R. P., Brass, M., Bekkering, H, Mazziotta, J. C., & Rizzolatti, G. (1999). Cortical mechanisms of human imitation. Science, 286, 2526–2528. Kimura, D. (1993). Neuromotor mechanisms in human communication. Oxford: Oxford University Press. Kohler, E., Keysers, C., Umiltà, M. A., Fogassi, L., Gallese, V., & Rizzolatti, G. (2002). Hearing sounds, understanding actions: Action representation in mirror neurons. Science, 297, 846–848. Jürgens, U. (1995). Neuronal control of vocal production in human and non human primates. In Zimmerman, E., Newman, J. D. & Jürgens, U., (eds.), Current topics in primate vocal communication. Plenum Press, pp. 199–206. Jürgens, U. (2002). Neural pathways underlying vocal control. Neuroscience and Biobehavioral Review, 26, 235–258. MacLean, P. D. (1993). In: Neurobiology of cingulate cortex and limbic thalamus: A comprehensive handbook, Vogt, B. A. & Gabriel, M. (eds.), pp. 1–15. Birkhäuser. MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. Behavioral and Brain Sciences, 21, 499–546. Maestripieri, D. (1996). Gestural communication and its cognitive implications in pigtail macaques (Macaca nemestrina). Behaviour, 133, 997–1022. McNeill, D. (1992). Hand and mind. Chicago: University of Chicago Press. Paget, R. A. S. (1963). Human speech: Some observations, experiments and conclusions as to the nature, origin, purpose and possible improvement of human speech. Routledge & Kegan Paul. Parr, L. A., & Maestripieri, D. (2003). Nonvocal communication. In: Maestripieri, D. (ed.), Primate Psychology. Cambridge, MA: Harvard University Press, pp. 324–358. Parsons, L. M., Fox, P. T., Downs, J. H., Glass, T., Hirsch, T. B., Martin, C. C., Jerabek, P. A. & Lancaster, J. L. (1995). Use of implicit motor imagery for visual shape discrimination as revealed by PET. Nature, 375, 54–58. Petrides, M., & Pandya, D. N. (1994). Comparative architectonic analysis of the human and the macaque frontal cortex. In Boller, F., & Grafman, J. (eds.), Handbook of neuropsychology. Amsterdam: Elsevier, pp. 17–58. Petitto, L. A., Zavorre, R. J., Gauna, K., Nikelski, E. J., Dostie, D., & Evans, A. C. (2000). Speech-like cerebral activity in profoundly deaf people processing signed languages: Implications for the neural basis of human language. Proceedings of the National Academy of Sciences of the USA, 97, 13961–13966. Ploji, F. X. (1978). Some basic traits of language in wild chimpanzees? In: Lock, A. (Ed.), Bifore speech: The beginning of interpersonal communication. Cambridge: Cambridge University Press, pp. 223–243. Ploji, F. X. (1984). The behavioral development of free-living chimpanzee babies and infants. Norwood, NJ: Ablex. Rizzolatti, G., & Arbib, M. A. (1998). Language within our grasp. Trends in Neurosciences, 21, 188–194 (1998). Rizzolatti, G., & Luppino, G. (2001). The cortical motor system. Neuron, 31, 889–901.

Mirror neurons, gestures and language evolution

Rizzolatti, G., Fogassi, L., & Gallese, V. (2000). Cortical mechanisms subserving object grasping and action recognition: a new view on the cortical motor function. In Gazzaniga, M. S. (Ed.) The new cognitive neurosciences, 2nd edition. Cambridge, MA: MIT Press, pp. 539–552. Rizzolatti, G., Fogassi, L., & Gallese, V. (2001). Neurophysiological mechanisms underlying the understanding and imitation of action. Nature Neuroscience Reviews, 2, 661–670. Rizzolatti, G., Fadiga, L., Gallese, V., & Fogassi L. (1996). Premotor cortex and the recognition of motor actions. Cognitive Brain Research, 3, 131–141. Rizzolatti, G., Camarda, R., Fogassi, L., Gentilucci, M., Luppino, G. & Matelli, M. (1988). Functional organization of inferior area 6 in the macaque monkey: II. Area F5 and the control of distal movements. Experimental Brain Research, 71, 491–507. Tanner, J. E., & Byrne, R. W. (1996). Representation of action through iconic gesture in a captive lowland gorilla. Current Anthropology, 37, 162–173. Tomasello, M., Gorge, B. L., Kruger, A. C., Farrar, M. J., & Evans, A. (1985). The development of gestural communication in young chimpanzees. Journal of Human Evolution, 14, 175–186. Tomasello, M., Call, J., Warren, J., Frost, G. T., Carpenter, M., & Nagell, K. (1997). The ontogeny of chimpanzee gestural signals: A comparison across groups and generations. Evolution of communication, 1, 223–259. Tomasello, M., Gust, D. & Frost, G. T. (1989). A longitudinal investigation of gestural communication in young chimpanzees. Primates, 30, 35–50. Umiltà, M. A., Kohler, E., Gallese, V., Fogassi, L., Fadiga, L., Keysers, C. & Rizzolatti, G. (2001). I know what you are doing. a neurophysiological study. Neuron, 31, 155–65. Van Hooff, J. A. R. A. M. (1962) Facial expressions in higher primates. Symposium of the Zoological Society of London, 97–125. Van Hooff, J. A. R. A. M. (1967) The facial displays of the catarrhine monkeys and apes. In Morris, D. (Ed.), Primate Ethology. London: Weidenfield & Nicolson, pp. 7–68.

About the authors Leonardo Fogassi is Associate Professor of Neuroscience at the Department of Psychology, University of Parma and he carries out his research at the Department of Neuroscience of the University of Parma. He is a biologist and he was then trained in neurophysiology. He received a PhD in Neuroscience at the University of Parma (Italy), during which time he spent also one year in the neurophysiology lab of R. Andersen in M.I.T., Cambridge MA. He has published many papers on the neurophysiological properties of monkey premotor cortex, concentrating his interest on the neuronal bases of visuomotor transformation for action and on the cognitive aspects of motor and parietal cortex. Pier Francesco Ferrari is a research fellow at the Department of Neuroscience, University of Parma, Italy. He is a biologist and received a PhD in Ethology in 1997 at the University of Parma (Italy) before moving to the United States as a post-doc at Tufts University in Boston. He taught Neuroscience in the school of Psychology at the University of Parma. He has published on behavioral, cognitive and neurophysiological basis of social behavior in rodents and macaque monkeys.

45

Lateralization of communicative signals in nonhuman primates and the hypothesis of the gestural origin of language Jacques Vauclair Research Center in Psychology of Cognition, Language and Emotion, University of Provence

This article argues for the gestural origins of speech and language based on the available evidence gathered in humans and nonhuman primates and especially from ape studies. The strong link between motor functions (hand use and manual gestures) and speech in humans is reviewed. The presence of asymmetrical cerebral organization in nonhuman primates along with functional asymmetries in the perception and production of vocalizations and in intentional referential gestural communication is then emphasized. The nature of primate communicatory systems is presented, and the similarities and differences between these systems and human speech are discussed. It is argued that recent findings concerning neuroanatomical asymmetries in the chimpanzee brain and the existence of both mirror neurons and lateralized use of hands and vocalizations in communication necessitate a reconsideration of the phylogenic emergence of the cerebral and behavioral prerequisites for human speech. Keywords: evolution, communication, primates, gesture, language, vocalization, mirror neurons

The gestural hypothesis of speech origin and animal models The idea that nonhuman primates are as efficient in producing gestures as in vocalizing was proposed long ago. In 1661, a British gentleman, Samuel Pepys, wrote in his diary about an animal he called a “baboone,” which was more likely a chimpanzee: “I do believe it already understands much English; and I am of the mind it might be taught to speak or make signs” (cited by Wallman, 1992, p. 11). In more recent times, Hewes (1973) championed the view that gestural communication played a crucial role in human language evolution. He suggested that very early in

48

Jacques Vauclair

the history of the human species, gestures were under voluntary control and thus became an easy way to communicate long before the emergence of speech. Several researchers have since endorsed this view (e.g., Armstrong, Stokoe, & Wilcox, 1995; Kendon, 1991, 1993; Kimura, 1993). The thesis is also central to the propositions made by Corballis (1989, 1991, 2003), for whom manual gesturing was the mediating factor in the evolution of handedness and speech in humans. Corballis has emphasized that during the course of evolution, the left cerebral hemisphere acquired a general capacity for “generativity.” As this capacity is understood to be one of the hallmarks of language, it would serve as common substratum for image generation as well as an organizer of actions and pre-adaptations for speech production. A left hemispheric control for gestural acts would thus represent a feature much older than speech. Moreover, such control by the left hemisphere is viewed as the origin of the left cerebral lateralization for language in humans. According to a slightly different view, vocal as well as gestural communication would imply a sequential and temporal organization of movements (Bradshaw, 1988). Evolutionary pressures could thus have favored both functions in relation to the control of gestures and speech within the same cerebral hemisphere (the left). Note that this latter view presents the advantage of tracing an evolutionary path from animal to human communication without referring to animal vocalizations. This article aims to show that several features of the brain and of communicatory behaviors (gestural and vocal) of nonhuman primates, especially apes, can provide useful clues for discussing the issue of the origins of speech and language. Since the question of gestures and speech and their cerebral control is central for the debate concerning the origins of language, I will first summarize the current state of knowledge in humans in this area. I will then introduce the question of cerebral and functional asymmetries in nonhuman primates with an emphasis on the perception and production of vocal and gestural signals in cases of spontaneous and induced communication. I will then review other kinds of evidence which support the gestural origin hypothesis, namely the existence of mirror neurons in the monkey brain, and the characteristics of primate communication compared with those of human communication and language. Finally, I will pinpoint the implications and limitations of the primate model for discussing the question of the origin of speech, language and handedness.

Gestures, speech and hemispheric control in adults, children, infants and fetuses Studies carried out with deaf people are useful to show the close relation between gestures and the left hemisphere. Firstly, it has been demonstrated that similar

Lateralization of gestures in primates and origin of language

areas within the left hemisphere are involved in the comprehension and in the production of signs by deaf people (Corina, Vaid, & Bellugi, 1992; Grossi et al., 1996). Moreover, the acquisition of sign language and that of speech present strong similarities in human infants that can be exemplified by the presence of “silent babbling” among hearing infants born to profoundly deaf parents during the course of their acquisition of natural signed languages (Petitto et al., 2001). Furthermore, Holowka and Petitto (2002) discovered that babies babble with a greater mouth opening on the right side of their mouths, indicating left brain hemisphere control for this activity. The authors conclude that babbling engages the language processing centers in the left hemisphere of the brain. Other findings support the role of the left cerebral hemisphere in the simultaneous control of vocal and gestural communication in humans. Such a relation is clear in the preferential usage of the right hand in situations in which participants are asked to recall lists of words or narratives (Kimura, 1973). It has also been observed that the complexity and frequency of gestures made by adults and children are highly related to the complexity and frequency of their spontaneous language. Thus, it is not surprising to observe that stuttering people interrupt their gestures until speech goes on (Mayberry, Jaques, & DeDe, 1998). The relation between gestures and speech is very strong during human ontogeny, with an increasing involvement of the right hand for gestural communication (Blake, O’Rourke, & Borzellino, 1995). This association is reinforced when vocalizations and speech intervene simultaneously (Locke et al., 1995). As far as human ontogeny is concerned, two sets of data favor the predominance of manual and gestural activities over oral activity. Firstly, in human infants, intentional control of the hands and arms is present at around three months of age (e.g., grasping an object placed in the hands and bringing it to the mouth: Rochat, 1989) and precedes the full coordination of vision and prehension by three to four weeks. By contrast, the development of infant intentional vocal control takes much longer (till the end of the first year: Iverson & Thelen, 1999), and remains imperfect for a much longer period of time (sounds substitutions, reversals and omissions are frequent in young children’s language). Secondly and complementarily, the control of the forelimbs (arms and hands) seems to be lateralized long before vocal asymmetry. Thus, newborns have been reported to show predominant right side biases in head-turning and Moro responses (Michel, 1981; Rönnqvist & Hopkins 1998). Motor asymmetries favoring the right side for arm activity (Hepper, MacCartney, & Shannon, 1998) and thumb sucking (Hepper, Shahidullah, & White, 1991) have even been reported in 10- to 15-week old fetuses. Altogether, these ontogenetic data suggest that asymmetries of the forelimbs develop before vocal asymmetry. Of course, this advance does

49

50

Jacques Vauclair

not implicate this development occurred during the evolution from nonhuman primates to humans. However, the fact remains that these features highlight the early maturation of motor functions and their later cerebral control in our species. In particular, it is very likely that speech and gesture have their developmental origins in early hand-mouth linkages, such that as oral activities become gradually used for meaningful speech, these linkages are maintained and strengthened. For Iversen and Thelen (1999), hand and mouth are tightly coupled in the mutual cognitive activity of language, and these authors argue that these systems are initially linked together as these sensorimotor linkages form the bases for their later cognitive interdependence.

Evidence of structural and functional asymmetries in nonhuman primates It appears that recent neuropsychological and behavioral findings in great apes are of significant interest because they pertain to basic theories on the origin of language and speech in humans. There is now a growing body of evidence that challenges the long-held view that brain asymmetries and handedness are exclusively human traits (e.g., Warren, 1980; Corballis, 1991). This section will thus be devoted to summarizing the major findings in support of the view that both at the cerebral and behavioral levels, nonhuman primates show clear patterns of asymmetric processing of information, some of which are of obvious importance for the theory of language and for its evolution (Vauclair, Fagot, & Dépy, 1999). I will examine, in turn, demonstrations of hemispheric asymmetries in great apes and functional lateralized processing of information in relation to audition and to the motor systems within the context of intentional communication.

Evidence for neuroanatomical asymmetries in the brain of apes Two areas of the brain which are crucial for speech and language (namely Broca’s area and Wernicke’s area) have been studied in apes, in search of possible size differences between the left and the right cerebral hemispheres. Using magnetic resonance imaging, Gannon et al. (1998) found that the planum temporale of great apes (gorillas, chimpanzees and orangutans) was larger in the left than in the right cerebral hemisphere (this was true for 17 of 18 chimpanzee cadaver specimens studied). More recently, Cantalupo and Hopkins (2001) used an MRI technique to measure Broadman area 44 (roughly corresponding to Broca’s area) in a sample of 27 great apes. These researchers found that 20 of the apes had a left hemisphere asymmetry, six had a right hemisphere asymmetry and one ape (a bonobo) had no

Lateralization of gestures in primates and origin of language

bias. It remains to be shown that this strong similarity in asymmetrical organization of the brain between humans and apes is related to functional asymmetries. This issue will be touched upon below in the discussion of the production of intentional gestures and associated vocalizations in the chimpanzee. It can be observed that some asymmetries in the Sylvian region have also been found in non-ape species. For example, the length of the left Sylvian fissure has been found to be significantly longer than its right counterpart in the rhesus monkey (Falk et al., 1986).

Behavioral evidence of asymmetries in the perception and production of auditory communications I will now review some of the main findings obtained in relation to the processing of communicatory information in nonhuman primates. Asymmetries in the perception of auditory communication. A widely cited study by Petersen et al. (1978) used the dichotic technique to examine lateralized processing in the perception of species-specific vocalizations in macaques. Japanese macaque vocalizations were presented either to the left or the right ear of the subjects (Japanese macaques and other macaque species). The authors reported that all five Japanese macaques responded faster in the task when the stimuli were presented to the right ear, whereas only one of the remaining five monkeys showed the same right ear advantage. None of the subjects showed a significant left ear advantage. Since right ear information predominantly reaches the left hemisphere, the authors concluded that the left hemisphere of the Japanese macaque was specialized to process meaningful (i.e., species-specific) vocalizations. Using the same technique, Heffner and Heffner (1984) further demonstrated that monkeys with a left hemisphere lesion of the posterior temporal lobe showed a greater decrement in post-operative performance and took longer to re-learn the discrimination task than did right hemisphere-lesioned monkeys. This set of studies suggests that vocalizations in monkeys are controlled by the left hemisphere. In a more naturalistic context, Hauser and Andersson (1994) examined orienting asymmetries to different auditory stimuli in rhesus monkeys living as a social group on the island of Cayo Santiago. While feeding at a food dispenser, individual monkeys were presented with different types of vocalizations. The stimuli (played over a concealed loudspeaker 4 to 10 meters behind the monkey) were presented to the subject. The experimenters recorded which direction (left or right) the monkeys turned to orient toward the sound. Hauser and Andersson (1994) reported that significantly more monkeys oriented to the right compared to the left for conspecific calls but not for a heterospecific call (that of a songbird). These authors interpreted their findings as evidence that the left hemisphere is dominant in

51

52

Jacques Vauclair

processing species-specific calls in rhesus monkeys. In a more recent study, Hauser, Agnetta, and Perez (1998) tested the same monkeys with an identical procedure and manipulated the interpulse interval for three different types of rhesus monkey vocalizations, such as grunts and alarm calls. Variations in the interpulse intervals were either longer or shorter than the population mean pulse interval for each of the call types. The main results indicate that manipulations of the interpulse intervals outside the range of natural variation either eliminated the orienting bias or caused a shift from right- to left-ear bias. Altogether, the above results show that a) temporal properties such as interpulse interval provide significant information to listeners about whether or not the signal is from a conspecific, and that b) the orienting bias is controlled by left hemispheric asymmetries. In a final experiment, Ghazanfar, Smith-Rohreberg, & Hauser (2001) studied orienting responses of rhesus monkeys to time-reversed vocalizations. The monkeys in the study oriented to the left, behaving as if these stimuli were novel to them. These results suggest that rhesus macaques use temporal cues to recognize conspecific vocal signals and that, at least for the kind of response used in this set of studies, it is the left hemisphere that is predominantly involved. Interestingly, the relation between the temporal features of the rhesus monkey vocalizations and cerebral organization appears to be similar to what is observed in humans (Belin et al., 1998). Asymmetry in the production of auditory communication. Only one study is available concerning lateralization in the production of vocalizations in nonhuman primates. Hauser and Akre (2001) videotaped the timing asymmetry of both facial and vocal expressions in Cayo Santiago rhesus monkeys. They observed that for both adults and infants, the left side of the face initiated the expression before the right, thereby implicating a right hemisphere specialization. As some of the recorded expressions were related to positive/approach emotions while others were associated with negative/withdrawal emotions, emotional valence did not appear to influence the direction of this motor asymmetry. Such results are somewhat difficult to interpret, as they stand in sharp contrast with the data reported for the perception of vocalizations in macaques, a species for which a left hemispheric advantage has been systematically reported. They are also difficult to explain with respect to the laterality of the mechanisms controlling both speech perception and production in humans, which are mostly underlain by structures located in the left cerebral hemisphere (see Hauser & Akre, 2001 and Weiss et al., 2002 for hypotheses concerning potential differences between these mechanisms in human and nonhuman primates; see also the section below on the cortical control of nonhuman primate vocalizations).

Lateralization of gestures in primates and origin of language

Animal communication and intentions A crucial issue for establishing a valid nonhuman primate model of human communication, including speech, concerns the status of the signals (vocalizations, gestures) used by primates in their spontaneous communication as well as those used in trained situations in which apes are taught forms of human language. To make a long story short, this question amounts to asking if these signals are referential and thus could be more or less equivalent to linguistic signs or if these signals exclusively convey emotionally-based information. This matter is controversial among primatologists and comparative psychologists. Some consider that these signals (vocalizations) convey information with semantic content concerning, for example, the presence of predators (Seyfarth, Cheney, & Marler, 1980), food (Dittus, 1984) or social relationships (Gouzoules, Gouzoules, & Marler, 1984), while others call for more cautions interpretations of these communications and suggest that they are likely to combine both emotionally and referentially based information (e.g., Hauser, 2000; Vauclair, 2003). Interestingly, the difficulties in interpreting nonhuman primate communicative signals culminate in discussions about auditory signals because of the implicit or explicit relation that exists between these signals and linguistic signs (Vauclair, 1996). The question of the symbolic or semantic status of gestural signals seems to be less decisive because, as Leavens explains in his article (this issue), gestures rarely if ever stand for the event or object to which attention is being drawn. Thus, it is easier to propose an operational definition of gestures as referential signals in the sense of behaviors serving to direct attention. Moreover these gestures can also be viewed as intentional because (a) they are produced in a social context, (b) they imply visual contact between the partners engaged, and (c) they imply some changes in the behavior of both the signaler and the partner. For Leavens (this issue), these criteria are met for the gestures used by apes, especially for pointing.

Laterality and manual gestures in intentional communication An interesting and novel field of inquiry has recently emerged in the comparative literature concerning the functional use of gestures in great ape communication. Wild chimpanzees are known to use communicative gestures in various contexts such as begging for food, courtship, intimidation, greetings, etc. (Goodall, 1986; Plooij, 1978). In contrast with vocalizations, the use of these gestures requires close visual contact between partners. In addition, the gestures are usually performed between only two individuals. In this respect, communicative gestures are more

53

54

Jacques Vauclair

appropriate than vocal signals in the search for the evolutionary precursors of speech, because the latter are typically not directed to specific individuals. A number of independent observations carried out on captive apes have shown that these communicative gestures are preferentially performed with the right hand (in gorillas: Shaffer, 1993; in bonobos: Shafer, 1997; in chimpanzees: Hopkins & Leavens, 1998). The referential and intentional status of these gestures has also been convincingly established (Leavens, this issue). Captive apes are regularly observed using manual gestures when food is placed out of their reach. If an audience is present, the apes increase the frequency of their gestures and alternate their gaze between the food object and the social agent. These behaviors suggest that the apes monitor the effect of their gestures on the social partner (a human) to whom they direct their communicative acts. In a unique study, Hopkins and Cantero (2003) examined the spontaneous production of gestures and vocalizations in a captive group of 73 chimpanzees. The study was prompted by observations that right hand use in gestural communication was significantly higher when the gestures were accompanied by a vocalization. The procedure was simple: an experimenter stood approximately one meter from the chimpanzees’ home cage and directly in front of the chimpanzee subject. The experimenter approached the cage and offered the chimpanzee a banana. Since the banana was out of the immediate reach of the ape, this condition stimulated the production of communicative behaviors by the chimpanzee subject. Note that the experimenter maintained eye contact with the subject throughout the duration of the trial in order to increase the probability that the ape would produce a communicative behavior. Begging gestures accompanied or non-accompanied by vocalizations toward the experimenter were recorded for one minute. The data showed that each chimpanzee produced on average 29 gestures (over ten trials), about seven of which were accompanied by a vocalization. Concerning laterality, right-hand population biases were found for gestures alone and for gestures associated with vocalizations. Within the entire sample of chimpanzees, 51 subjects produced gestures both with and without vocalizations. An analysis conducted on this subsample revealed that gesture + vocal right handedness scores were significantly higher than the gesture + no-vocal handedness scores (Figure 1). It is important to establish whether the use of the right hand within a communicative context generalizes to other motor tasks or is specific to gestural communication. Since the chimpanzees tested in the study are being reared in a human-designed, right-handed world, it needs to be shown that the preferential use of the right hand for gestural communication is not correlated with other measures of hand use and therefore does not reflect a bias associated with other motor functions. Hopkins and Cantero (2003) verified that this was not the case, finding that the use of

Lateralization of gestures in primates and origin of language

60

Gestures + Vocal

Mean HI values

50 40 30

Total Gestures + No-vocal

20 10 0

Groups

Figure 1. Mean handedness indices (HI) for the overall number of gestures, the gestures produced with a vocalization (Gesture + Vocal) and the number of gestures produced without a vocalization (Gesture + No-Vocal).

Mean handedness indices (HI) were derived by subtracting the number of left hand responses from the number of right hand responses and dividing by the total number of responses HI=[(R-L)/(R+L)]. Indices < 0 indicate a left bias; indices > 0 indicate a right bias. The figure shows HI values for the overall number of gestures (Total), the number of gestures produced with a vocalization (Gesture + Vocal) and the number of gestures produced without a vocalization (Gesture + No-Vocal). All were significantly different from zero. In addition, the HI values for the Gesture + Vocal responses were significantly higher than the HI values produced for the Gesture+ No-Vocal responses. (Adapted from Hopkins & Cantero, 2003)

the right hand in communicative contexts was independent of other measures of handedness such as hand use in simple reaching, and in bimanual feeding. The findings from this study thus indicate that the preferential use of the right hand for gestures is significantly enhanced when the gestures are accompanied by a vocalization. Taken together, these results suggest that the neurobiological substrates for nonvocal intentional, referential gestural communication are lateralized to the left hemisphere. Moreover, these results further imply that the production of vocalizations used by chimpanzees may be lateralized to the left hemisphere because they have a facilitative effect on right but not on left hand use in gestural communication. This set of data thus shows a remarkable convergence with the behavior of humans (children: Blake et al., 1994, and adults: Kimura, 1973) when they simultaneously produce speech and manual gesticulations. A fascinating extension of these findings was reported by Hopkins and Cantalupo (2003). Based on their report that Brodmann’s area 44 (BA44) was larger

55

56

Jacques Vauclair

in the left compared with the right hemisphere in the great apes (Cantalupo & Hopkins, 2001), these authors looked for a possible association between the anatomical asymmetries observed in Broca’s area and asymmetries in gestural communication, as well as in hand use for simple reaching. Using a subsample of the 20 chimpanzees previously examined with MRI techniques (see above), Hopkins and Cantalupo (2003) found negative correlations between the handedness index values for gestures and BA44. This result indicates that increased right hand use is associated with larger left hemisphere in Broadman 44 values. When the correlation coefficients are adjusted for simple reaching, the index values for communicative gestures were significantly associated with the medial portion of BA44 and close to statistical significance for the total BA44. These findings need to be investigated further with a larger sample of apes to more completely establish the association. Nevertheless, these data reveal for the first time that structural asymmetries in the brain of the great ape have functional counterparts in the asymmetry of hand use and notably with respect to the production of intentional vocal and gestural communications.

Other kinds of evidence in nonhuman primates Mirror neurons in the monkey brain The discovery of neurons in the monkey’s premotor cortex that discharge both when the monkey makes a particular action and when it observes another individual, monkey or human, making a similar action (Gallese et al., 1996) offers converging evidence of the importance of manual actions and gestures in understanding actions made by others. The existence of such mirror neurons that map perception onto execution could provide one of the keys for understanding the origin of language. Note that these mirror neurons are located in area F5, a homologue of Broca’s area in the monkey brain. Such mirror neurons have also been described in Broca’s area in humans (e.g., Nishitani & Hari, 2000), suggesting that the representation of actions and speech is processed by the same cerebral structures. In a recent study, Kohler et al. (2002) reported that in area F5 of the macaque brain, there are not only visual mirror neurons but also auditory mirror neurons. These neurons discharge when the animal performs a specific action, as well as when it hears the sounds produced by such an action (e.g., ripping a piece of paper or dropping a stick). It thus seems that area F5 of the monkey brain is predisposed to managing not only visuo-gestural but also auditory-visual systems of communication.

Lateralization of gestures in primates and origin of language

From the above findings, the perspective proposed here is that the development of the human lateral speech circuit resulted from the fact that the precursor of Broca’s area was endowed, before the appearance of speech, with a mechanism for recognizing actions made by others. This mechanism was the neural prerequisite for the development of inter-individual communication and finally of speech. In this respect, language needs to be viewed in a more general setting than one that considers speech as its complete basis, as it is involved in both action recognition (including gestures) and speech processing (Rizzolati & Arbib, 1998).

Some functional differences between animal communication and human language A consideration of the specific modalities underlying speech and a comparison between these modalities and nonhuman communication may also help shed light on the question of the gestural origin of language. Developmental psychologists distinguish two main modalities or functions in linguistic as well as prelinguistic communication among humans (Bates, 1979). The primary function of language is to exchange information about the world. Such an informative function takes two forms: a declarative form used in representing states of the world (e.g., “John is coming”) and an interrogative form. The other function is injunctive (imperative) and exclamatory and mostly expresses itself with requests and demands (e.g., “Come here!”). Developmental studies with young children have shown that the use of declaratives (see references in Vauclair, 2003) becomes the dominant mode of communication between one and two years of age (about 60% of all utterances). It happens that a major difference between humans and nonhuman primates lies in the fact that the use of signals and learned symbols by the nonhumans is largely restricted to their imperative function, whereas humans use them predominantly for declarative purposes. These declaratives can be words or gestures, and they function not primarily to obtain a result in the physical world, but to direct another individual’s attention (his or her mental state) to an object or event, as an end in itself. Thus, a human toddler might say “Bird!” apparently to mean, “It’s a bird!” or, “Look! A bird,” and so on. In such cases, the child communicates simply to share interest in something that he or she sees, that this object is a bird and that the child has identified it and finally that he or she wants the partner to look at it. It can be asserted with some confidence that the use of protoimperative signals is the exclusive mode of communication by animals of different phyla. When, for example, your cat meows at you in the vicinity of the window and at the same time glances back and forth from the window to you, the cat is using a protoimperative

57

58

Jacques Vauclair

signal that can be interpreted as “I want to go out.” But it is very unlikely that your cat would use these same communicative signals to let you know that it has noticed something interesting in the garden and that it wants to share its discovery with you. I have claimed (Vauclair, 1982, 1984, 1996, 2003; see also Tomasello & Camaioni, 1997) that this imperative function also appears to be the predominant (if not exclusive) mode used by “linguistically” trained apes. For example, an analysis of the combinations of visual productions made by the famous bonobo Kanzi (Savage-Rumbaugh, Rumbaugh, & McDonald, 1985) reveals that 96% of this ape’s productions were requests. Interestingly, these productions mostly consisted of combinations of visual signals (lexigrams punched on a keyboard) and gestures directed to the human partner. Thus, the difference between Kanzi’s modality of communication and the typical declarative mode observed by humans is striking. In effect, communication in apes has essentially an imperative function. This appears to be the rule for all animal species and this mode fulfills biological requirements, for example warning against predators, as in vervet monkey alarm calls (Seyfarth et al., 1980). By contrast, humans use not only speech but also prelinguistic communication means such as gestures (e.g., pointing) for both imperative and declarative purposes (for example, two persons sharing an interest toward a third person, an object, or an event). Place (2000) has argued that there was in humans an ontogenetic primacy of the use of the system of “mands” in the sense of Skinner (1957) compared to the system of “tacts.” Mands can be broadly defined as commands, requests or questions that the speaker addresses to a listener. A mand serves to specify an action to be performed by the listener, the realization of which operates primarily for the benefit of the speaker. By contrast, tacts constitute more complex forms of behaviors in the sense that “they are reinforced, not, as in the case of the mand, by the behavior they call for from the listener, but by a variety of specialized reinforcers, responses such as gratitude for information supplied, agreement with opinions given, sympathy for troubles told, surprise at and interest in news reported, or laughter at jokes” (Place, 2000, II.iii). It follows from this distinction that “in the evolution of language it [the tacts] must have developed later [than the mands], as it does in the child. Moreover, since interrogative mands presuppose the availability of the tacts they solicit from the listener, it follows that the first sentences must all have been imperatives” (Place, 2000, II.iii). The parallel between mands and tacts with imperatives and declaratives and their respective functions is striking. It is thus tempting to speculate that the mands and protoimperative actions are the dominant actions both in the nonhuman primates and in the developing human infant and child. It is also likely that these

Lateralization of gestures in primates and origin of language

systems function best by means of mimed movements and by pointing gestures. This view is reminiscent of the scenario offered long ago by Condillac (1746) in his theory of a “language of action.” He stressed that man’s first efforts at communication required signals (gestural, pantomimes and then vocal signs) produced in a context in which they unambiguous and self-explanatory.

Cortical control of nonhuman primates’ vocalizations In the section devoted to the presentation of asymmetries in the production of auditory communications in nonhuman primates, I reported that these productions were lateralized to the right cerebral hemisphere in the macaque (Hauser & Akre, 2001). This finding is somewhat troubling in light of the human data concerning speech control. Steklis and Harnad (1976) wrote some years ago that that “the neural control of the vocal activity of nonhuman primates is somehow not adapted to the kind of activity involved in language. These vocalizations are controlled by evolutionarily primitive regions of the brain which are involved in stereotyped species-typical communicative behaviors and emotion” (p. 447). They added that “primate calls are a relatively restricted and predictable set for a particular species, and even if they depend upon experience for acquisition, the amount of variation in the final product is negligible compared to the variety of learned complex behaviors of which the limbs, the most qualified candidates of all, are capable” (p. 445). What explanations could be offered to explain the findings on comprehension/production systems and their lateralization in nonhuman primates? First we must dissociate comprehension from production in terms of the evolutionary pressures that have acted on the processes and on their cerebral organizations. As observed by Hauser (1996), the cortical component in primate vocalization may be more pronounced with respect to perception than with respect to production. In the former case, several demonstrations mentioned earlier in this paper suggest that the cortical system for the perception of species-specific calls in nonhuman primates is lateralized to the left. For production systems which appear to be lateralized to the right side of the brain in monkeys, a simple explanation is to consider that the production of these calls occurs in emotional situations, such as danger to the group. In this respect, it is not surprising, given the content and nature of the information conveyed, to observe a control by the right hemisphere. In addition, the lack of intentional control over these calls may be adaptive because it makes them impossible to fake (Knight, 1998). The picture is very different for chimpanzees and possibly for other great apes. In chimpanzees, there is good evidence for the existence of both structural asymmetries in Broca’s area and Wernicke’s area (see above) and functional lateralization

59

60

Jacques Vauclair

in the association of gestures and vocalizations during intentional communicative actions (see Hopkins & Cantero, 2003 and above). It is noteworthy that, to my knowledge, only one study is available concerning cerebral control of vocalizations in apes. Bernston, Boysen, and Torello (1993) recorded ERP measures in chimpanzees during the presentation of simple non-signal stimuli as well as conspecific and human vocalizations and found a right hemisphere laterality in the processing of the significant vocal stimuli. This study concerned only a single chimpanzee and in no way permits us to draw a definite conclusion. Data on volitional control of vocalizations both for comprehension and production are thus badly needed. Such data will help us to better understand the neural systems involved in higher cognitive and communicative abilities in chimpanzees and other ape species.

Theoretical implications I wish to point to the implications of the results reviewed here to the use of a primate model in support of the gestural theory of the origin of speech. Firstly, there is now some evidence that chimpanzees not only possess brain asymmetries in speech-related areas but also use gestures in an intentional and referential way. Such findings offer clear support for theories proposing gestural origins of human language and speech (e.g., Kendon, 1995; Corballis, 2003). A safe hypothesis is to consider that this asymmetry was also present in the common ancestor of humans and chimpanzees, at least 5 million years ago when the ape-human lineage split. Secondly, given the available evidence, it might be wise to distinguish handedness from a lateralized hand use within a communicative context. The argument can be described with respect to two issues. The first point concerns the systematic report of left hemispheric control of vocalizations in an impressive range of animal species (from frogs to mice, and from birds to dolphins and monkeys: for reviews, see Rogers & Andrew, 2002). This robust coherence of left hemispheric control in vocal communication in the animal kingdom most likely reflects the necessity to fulfill basic needs in relation to the acoustical relevant features of the calls. In this respect, this left hemisphere control in vocal animals might be similar to its involvement, in humans, in the temporal and spectral analysis of speech (Fitch, Miller, & Tallal, 1997; Schwartz & Tallal, 1980). Thus, vocal communication in animals also relies heavily on the use of small and rapid changes in the sound produced. For example, Charrier, Mathevon, and Jouventin (2001) have reported that frequency modulation appears to be a key component for individual recognition in the sub-Antarctic fur seal. Similarly, Hauser et al. (1998, and see above) manipulated interpulse interval in rhesus monkey calls and showed that this change

Lateralization of gestures in primates and origin of language

provoked either an elimination of the left hemispheric bias or a shift from left to right bias. Aside from monkeys and apes, the species mentioned above do not possess limbs equivalent to hands but nevertheless show a left hemispheric control of the reception and sometimes of the production of their vocal communications. A second argument envisions that the relation between handedness and language is not total. This view comes from the obvious fact that about 70% of left-handers are also left-cerebrally dominant for language. From a brain imaging study of word generation on a large sample of right-handed participants, Knecht et al. (2000) concluded that the association between handedness and language dominance “is not an absolute one” (p. 78). These facts and the finding that about 65% of individuals belonging to large groups of chimpanzees exhibit right-hand preferences during bimanual coordination tasks (e.g., Hopkins, 1994) led Hopkins and Cantalupo (2003) to suggest that “from an evolutionary perspective, righthandedness may have evolved after the emergence of asymmetries associated with gestural communication, as Corballis [2003] has proposed, but handedness may not have been a direct consequence of selection for motor systems associated with language and speech in modern humans” (p. 225). In the article “From mouth to hand: Gesture, speech, and the evolution of right-handedness,” Corballis (2003) responds to the commentaries on his article. He defends the view that language has its origins in the gestural system, writing, “I also think it likely, despite the doubts of some commentators, that there is indeed a link between handedness and the left-cerebral control of speech, and the balance of evidence still seems to me to support the idea that it was an asymmetry in the control of the organs of speech that provided the nudge. Whether this asymmetry originated in the lateralized control of vocalization itself and whether it has ancient roots, now seem more problematic [my italics]. I think we need more evidence about the control of vocalization, from both evolutionary and neurological perspectives” (p. 250). I believe that the kind of evidence Corballis (2003) asks for is exactly what the recent ape studies on neuroanatomical asymmetries and on laterality in gestures suggest, namely that the neurobiological basis for intentional, referential communication was present prior to hominid evolution. Of course, a number of important issues need to be resolved to establish solid ground for the nonhuman primate model of speech and language origins. Although apes appear to represent particularly appropriate phylogenetic models for addressing these issues, there are still two serious problems that limit their use in the debate over the question of the origin of language. The first main problem, as I have stressed above, concerns the urgent need to obtain detailed information on the neural systems involved in the processing of communicative and cognitive abilities in these species. The introduction of novel

61

62

Jacques Vauclair

brain imaging techniques for investigating animals, including nonhuman primates, while they are awake (Logothetis, 2003), is very promising in this respect. The second main problem that cannot be solved by technical progress only relates to the determination of the nature of the vocal signals used by nonhuman primates. As I have noted earlier, the question of whether these vocalizations refer only to emotional states or convey semantic information is still controversial. In addition to the demonstrations offered earlier in this article, recent studies on Diana monkeys reinforce the view that the alarm calls of these monkeys are modulated in such a way that they provide information related not only to the class of predators signaled (the leopard or the crowned eagle) but also to the distance of the predator from the caller (Zuberbuhler, 2000). A detailed analysis of the calls of free-ranging Diana monkeys has also revealed that the modulation of the formants of the monkey calls results from an active vocal filtering (Riede & Zuberbuhler, 2003). Riede and Zuberbuhler argue that this filtering is used by the monkeys in order to encode semantic information. Of course, the underlying neural systems controlling this kind of signal in this species must still be explained.

Acknowledgments I thank William D. Hopkins for providing material and relevant references and for discussions on the question of the origins of language and speech. The writing of the ideas presented in this article was supported by the European Science Foundation EUROCORES-OMLL program “The Origin of Man, Language and Languages” and by the French program OHLL (CNRS) “L’Origine de l’Homme, du Langage et des Langues.”

References Armstrong, D., Stokoe, W., & Wilcox, S. (1995). Gesture and the nature of language. Cambridge, MA: Cambridge University Press. Bates, E. (1979). The emergence of symbols. Cognition and communication in infancy. New York: Academic Press. Belin, P., Zilbovicius, M., Crozier, S., Thivard, L., Fontaine, A., Masure, M.-C., & Samson, Y. (1998). Lateralization of speech and auditory temporal processing. Journal of Cognitive Neuroscience, 10, 536–540. Berntson, G. G., Boysen, S. T., & Torello, M. W. (1993). Vocal perception, brain event related potentials in a chimpanzee. Developmental Psychobiology, 26, 305–319. Blake, J., O’Rourke, P., & Borzellino, G. (1994). Form and function in the development of pointing and reaching gestures. Infant Behavior and Development, 17, 195–203. Bradshaw, J. L. (1988). The evolution of human lateral asymmetries: new evidence and second thoughts. Journal of Human Evolution, 17, 615–637.

Lateralization of gestures in primates and origin of language

Cantalupo, C., & Hopkins, W. D. (2001). Asymmetrical Broca’s area in great apes. Nature, 414, 505. Charrier, I., Mathevon, N., & Jouventin, P. (2001). Mother’s voice recognition by seal pups. Nature, 412, 873. Condillac, B. de (1746/1947). Essai sur l’origine des connaissances humaines. In: Oeuvres Philosophiques de Condillac. Paris: Georges LeRoy. Corballis, M. C. (1989). Laterality and human evolution. Psychological Review, 96, 492–505. Corballis, M. C. (1991). The lopsided ape: evolution of the generative mind. Oxford, UK: Oxford University Press. Corballis, M. C. (2003). From mouth to hand: gesture, speech, and the evolution of right-handedness. The Behavioral and Brain Sciences, 26, 199–260. Corina, D. P., Vaid, J., & Bellugi, U. (1992). The linguistic basis for left hemisphere specialization. Science, 255, 1258–1260. Dittus, W. (1984). Toque macaque food calls: semantic communication concerning food distribution in the environment Animal Behavior, 32, 470–477. Falk, D., Cheverud, J., Vannier, M. W., & Conroy, G. D. (1986). Advanced computer graphics technology reveals cortical asymmetry in endocasts of rhesus monkeys. Folia Primatologica, 46, 98–103. Fitch, R. H., Miller, S., & Tallal, P. (1997). Neurobiology of speech perception. Annual Review of Neuroscience, 20, 331–353. Gallese, V., Fadiga, L., Fogassi, L., & Rizzolatti, G. (1996). Action recognition in the premotor cortex. Brain, 119, 593–609. Gannon, P. J., Holloway, R.L, Broadfield, D. C., & Braun, A. R. (1998). Asymmetry of chimpanzee planum temporale: Humanlike pattern of Wernicke’s brain language area homologue. Science, 279, 220–222. Ghazanfar, A. A., Smith-Rohreberg, D., & Hauser, M. D. (2001).The role of temporal cues in rhesus monkey vocal recognition: orienting asymmetries to reversed calls. Brain Behavior and Evolution, 58, 163–172. Goodall, J. (1986). The Chimpanzees of Gombe. Cambridge, MA: Harvard University Press. Gouzoules, S., Gouzoules, H., & Marler, P. (1984). Rhesus monkey (Macaca mulatta) screams: representational signalling in the recruitment of agonistic aid. Animal Behaviour, 32, 182–193. Grossi, G., Semenza, C., Corazza, S., & Volterra, V. (1996). Hemispheric specialization for sign language. Neuropsychologia, 34, 737–740. Hauser, M. D. (2000). Wild minds. What animals really think. New York: Henry Holt & Company. Hauser, M. D. (1996). The evolution of communication. Cambridge, MA: Bradford Books/MIT. Hauser, M. D. Agnetta, B., & Perez, C. (1998). Orienting asymmetries in rhesus monkeys: The effect of time-domain changes on acoustic perception. Animal Behaviour, 56, 41–47. Hauser, M. D., & Andersson, K. (1994). Left hemisphere dominance for processing vocalizations in adult, but not infant, rhesus monkeys: Field experiments. Proceeding of the National Academy of Sciences, USA, 91, 3946–3948. Hauser, M. D., & Akre, K. (2001). Asymmetries in the timing of facial and vocal expressions in rhesus monkeys: Implications for hemispheric specialization. Animal Behaviour, 61, 391– 408. Heffner, H. E., & Heffner, R. S. (1984). Temporal lobe lesions and perception of species-specific vocalizations by macaques. Science, 226, 75–76. Hepper, P. G., Shahidullah, S., & White, R. (1991). Handedness in the human fetus. Neuropsychologia, 29, 1107–1111.

63

64

Jacques Vauclair

Hepper, P. G., MacCartney, G. R., & Shannon, I. A. (1998). Lateralised behaviour in the first trimester human fetuses. Neuropsychologia, 36, 531–534. Hewes, G. W. (1973). Primate communication and the gestural origin of language. Current Anthropology, 14, 5–24. Holowka, S., & Petitto, L. A. (2002). Left hemisphere cerebral specialization for babies while babbling. Science, 297, 1515. Hopkins, W. D. (1994). Hand preference for bimanual feeding in a sample of 140 chimpanzees. Developmental Psychobiology, 31, 619–625. Hopkins, W. D., & Cantalupo, C. (2003). Brodmann’s area 44, gestural communication and the emergence of right handedness in chimpanzees. Commentary on M. Corballis “From mouth to hand: The evolution of right handedness”. The Behavioral and Brain Sciences, 26, 224–225. Hopkins, W. D., & Cantero, M.(2003). From hand to mouth in the evolution of language: the influence of vocal behavior on lateralized hand use in manual gestures by chimpanzees (Pan troglodytes). Developmental Science, 6, 55–61. Hopkins, W. D., & Leavens, D. A. (1998). Hand use and gestural communication in chimpanzees (Pan troglodytes). Journal of Comparative Psychology, 112, 95–99. Iverson, J. M., & Thelen, E. (1999). Hand, mouth, and brain: The dynamic emergence of speech and gesture. Journal of Consciousness Studies, 6, 19–40. Kendon, A. (1991). Some considerations for a theory of language origins. Man, 26, 199–221. Kendon, A. (1993). Human gesture. In K. R. Gibson & T. Ingold (Eds.), Tools, language and cognition in human evolution (pp. 43–62). Cambridge, UK: Cambridge University Press. Kimura, D. (1973). Manual activity during speaking: I. Right-handers. Neuropsychologia, 11, 45–50. Kimura, D. (1993). Neuromotor mechanisms in human communication. Oxford: Oxford University Press. Knecht, S., Deppe, M., Dräger, B., Bobe, L., Lohmann, H., Ringelstein, E. B., & Henningsen, H. (2000). Language lateralization in healthy subjects. Brain, 123, 74–91. Knight, C. (1998). Ritual/speech coevolution: A solution to the problem of deception. In: J. R. Hurford, M. Studdert-Kennedy, & C. Knight (Eds.), Approaches to the evolution of language (pp. 68–91). New York: Cambridge University Press. Kohler, E., Keysers, C., Umiltà, M. A., Fogassi, L., Gallese, V., & Rizzolatti, G. (2002). Hearing sounds, understanding actions: Action representation in mirror neurons. Science, 297, 846–848. Locke, J. L., Bekken, K. E., McMinn-Larson, L., & Wein, D. (1995). Emergent control of manual and vocal-motor activity in relation to the development of speech. Brain and Language, 51, 498–508. Logothetis, N. K. (2003). MR imaging in the non-human primate: studies of function and of dynamic connectivity. Current Opinion in Neurobiology, 13, 1–13. Mayberry, R., Jaques, J., & DeDe, G. (1998). What stuttering reveals about the development of the gesture-speech relationship. New Directions for Child Development, 79, 77–87. Michel, G. F. (1981). Right handedness: A consequence of infant supine head orientation preference? Science, 212, 685–687. Nishitani, N., & Hari, R. (2000). Temporal dynamics of cortical representation for action. Proceedings of the National Academy of Science, USA, 9, 913–918.

Lateralization of gestures in primates and origin of language

Petersen, M. R., Beecher, M. D., Zoloth, S. R., Moody, D. B., & Stebbins, W. C. (1978). Neural lateralization of species-specific vocalizations in Japanese macaques (Macaca fuscata). Science, 202, 324–327. Petitto, L. A., Holowka, S., Sergio, L. E., & Ostry, D. (2001). Language rythms in baby hand movements. Nature, 413, 35–36. Place, U. T. (2000). The role of the hand in the evolution of language. Psycoloquy, 11, (007). Plooij, F. X. (1978). Some basic traits of language in wild chimpanzees. In A. Lock (Ed.), Action, gesture, and symbol: The emergence of language (pp. 111–131). London: Academic Press. Riede, T., & Zuberbuhler, K. (2003).The relationship between acoustic structure and semantic information in Diana monkey alarm vocalization. Journal of the Acoustical Society of America, 114, 1132–142. Rizzolatti, G., & Arbib, M. A. (1998). Language within our grasp. Trends in Neuroscience 21, 188–194. Rochat, P. (1989). Object manipulations and exploration in 2- to 5-month-old infants. Developmental Psychology, 25, 871–884. Rogers, L., & Andrews, R. (2002) .(Eds.). Cerebral vertebrate lateralization. New York: Cambridge University Press. Rönnqvist, L., & Hopkins, B. (1998). Head position preference in human newborn: A new look. Child Development, 69, 13–23. Savage-Rumbaugh, E. S., Rumbaugh, D. M., & McDonald, K. (1985). Language learning in two species of apes. Neuroscience and Biobehavioral Reviews, 9, 653–665. Schwartz, J., & Tallal, P. (1980). Rate of acoustic change may underlie hemispheric specialization for speech perception. Science, 207, 1380–1381. Seyfarth, R. M., Cheney, D. L., & Marler, P. (1980). Vervet monkey alarm calls: semantic communication in a free-ranging primate. Animal Behavior, 28, 1070–1094. Shafer, D. D. (1993). Patterns of hand preference in gorillas and children. In J. P. Ward & W. D. Hopkins (Eds.), Primate laterality: Current behavioral evidence of primate asymmetries (pp. 67–283). New York: Springer-Verlag. Shafer, D. D. (1997). Hand preference behaviors shared by two groups of captive bonobos. Primates, 38, 303–313. Skinner, B. F. (1957). Verbal Behaviour. New-York: Appleton-Century-Crofts. Steklis, H. D., & Harnad, S. (1976). From hand to mouth: Some critical stages in the evolution of language. Annals of the New York Academy of Sciences, 280, 445–455. Tomasello, M., & Camaioni, L. (1997). A comparison of the gestural communication of apes and human infants. Human Development, 40, 7–24. Vauclair, J. (1982). Sensorimotor intelligence in human and nonhuman primates. Journal of Human Evolution, 11, 757–764. Vauclair, J. (1984). A phylogenetic approach to object manipulation in human and nonhuman primates. Human Development, 27, 321–328. Vauclair, J. (1996). Animal cognition: Recent developments in modern comparative psychology. Cambridge, MA: Harvard University Press. Vauclair, J. (2003). Would humans without language be apes? In J. Valsiner (Series Ed.) & A. Toomela (Vol. Ed.), Cultural guidance in the development of the human mind: Vol. 7. Advances in child development within culturally structured environments (pp. 9–26). Greenwich, CT: Ablex Publishing Corporation.

65

66

Jacques Vauclair

Vauclair, J. Fagot, J., & Dépy, D. (1999). Nonhuman primates as models of hemispheric specialization. In: M. Haug & R. E. Whalen (Eds.), Animal models of human emotion and cognition (pp. 247–256). New York: APA Books. Wallman, J. (1992). Aping language. Cambridge, MA: Cambridge University Press. Warren, J. M. (1980). Handedness and laterality in humans and other animals. Physiological Psychology, 8, 351‑359. Weiss, D., Ghazanfar, A., Miller, C. T., & Hauser, M. D. (2002). Specialized processing of primate facial and vocal expressions: Evidence for cerebral asymmetries. In: L. Rogers & R. Andrews (Eds.), Cerebral vertebrate lateralization (pp. 480–530). New York: Cambridge University Press. Zuberbuhler, K. (2000). Referential labelling in Diana monkeys. Animal Behavior, 59, 917–927.

About the Author Jacques Vauclair is Professor of Developmental Psychology at the University of Provence, and Director of the Research Center in Psychology of Cognition, Language and Emotion, Aix-enProvence (France). He previously worked at the Research Center for Cognitive Neuroscience at the CNRS in Marseilles on laterality and comparative cognition in human and nonhuman primates. His most recent research interests concern issues related to laterality and the development and origin of language. He is Chief editor, jointly with Jean-Paul Caverni, of the e-journal Current Psychology Letters: Brain, Behavior and Cognition.

Manual deixis in apes and humans David A. Leavens University of Sussex

Pointing by apes is near-ubiquitous in captivity, yet rare in their natural habitats. This has implications for understanding both the ontogeny and heritability of pointing, conceived as a behavioral phenotype. The data suggest that the cognitive capacity for manual deixis was possessed by the last common ancestor of humans and the great apes. In this review, nonverbal reference is distinguished from symbolic reference. An operational definition of intentional communication is delineated, citing published or forthcoming examples for each of the defining criteria from studies of manual gestures in apes. Claims that chimpanzees do not point amongst themselves or do not gesture declaratively are refuted with published examples. Links between pointing and cognitive milestones in other domains relating means to ends are discussed. Finally, an evolutionary scenario of pointing as an adaptation to changes in hominid development is briefly sketched. Keywords: pointing, hominoidea, apes, communication, manual gestures, deixis, nonverbal reference

One of the most striking human developmental transitions is the dawn of manual deixis, or pointing, around the end of the first year of life (Bates, Camaioni, & Volterra, 1975; Butterworth, 2001; Carpenter, Nagell, & Tomasello, 1998; Franco & Butterworth, 1996). Deixis is the ability to locate, for an observer, a specific entity or location. Despite sporadic published reports of pointing by apes in captivity (reviewed by Leavens & Hopkins, 1999), until recently most developmental psychologists believed that pointing was a uniquely human behavior (e.g., Butterworth & Grover, 1988; Povinelli & Davis, 1994). Pointing is how one organism manipulates the visual attention of another to some distant entity; it is therefore a manifestly referential act, insofar as it co-ordinates the visual attention of two separate organisms (e.g., Bates, O’Connell, & Shore, 1987). At present, the evidence is as follows: (a) human infants in Western societies commonly point to distant objects or events by the beginning of the second year of life, (b) apes in the wild only rarely

68

David A. Leavens

Table 1. Phylogenetic patterns in manual deixis Characteristics of pointing Imperative?

Declarative?

Western civilization, Japan (Homo sapiens)a

Yes

Yes

Autismb

Yes

Rare

Chimpanzees (Pan troglodytes)c

Yes

Yes

Bonobos (Pan paniscus)d

Yes

Yes

Gorillas (Gorilla gorilla)e

Yes

Yes

Yes

Yes

Chimpanzees (Pan troglodytes)g

Yes

No

Bonobos (Pan paniscus)h

Yes

No

Gorillas (Gorilla gorilla)i

No

No

Yes

N/A

Chimpanzees (Pan troglodytes)k

N/A

Yes

Bonobos (Pan paniscus)l

N/A

Yes

Gorillas (Gorilla gorilla)

N/A

N/A

Orangutans (Pongo pygmaeus)

N/A

N/A

Rhesus macaque (Macaca mulatta)m

Yes

N/A

Capuchin (Cebus apella)n

Yes

N/A

Humans

Great apes (captive, language-trained or home-reared)

Orangutans (Pongo

pygmaeus)f

Great apes (captive, neither language-trained nor homereared)

Orangutans (Pongo

pygmaeus)j

Great apes (feral, in natural habitats)

Monkeys (captive)

Monkeys (feral, in natural habitats):

No reports of pointing to date.

Notes. N/A = insufficient data. Representative references: aBates et al., 1987 bBaron-Cohen, Cox, Baird, Swettenham, Nightingale, Morgan, Drew, & Charman, 1996 cGardner & Gardner, 1971; Kellogg & Kellogg, 1933; Savage-Rumbaugh, 1986 dSavage-Rumbaugh et al., 1998 eBonvillian& Patterson, 1999 fCall & Tomasello, 1994; Furness, 1916; Miles, 1990 gLeavens & Hopkins, 1998; Leavens et al., 1996, 2004a; hSavageRumbaugh, Wilkerson, & Bakeman, 1977 iPika, Liebal, & Tomasello, 2003; Tanner & Byrne, 1999 jCall & Tomasello, 1994 kInoue-Nakamura & Matsuzawa, 1997 lVeá & Sabater-Pi, 1998 mHess et al., 1993 nMitchell & Anderson, 1997

point, (c) apes in captivity point very frequently, usually in the complete absence of explicit training, (d) monkeys in the wild have not been reported to point, and (e) monkeys in captivity only rarely point spontaneously, but they can be readily trained to point (Table 1). Because pointing is rare in wild ape populations, yet commonplace in captive ape populations, in the complete absence of explicit training (e.g., Leavens

Manual deixis in apes and humans

& Hopkins, 1998; Leavens, Hopkins, & Bard, 1996; Leavens, Hopkins, & Thomas, 2004a), then this has clear implications for our understanding of the heritability of deixis. If pointing is the phenotype of interest, then variance in the expression of that phenotype (VP) is

VP = VG + VE + VGxE

where VG is variance attributable to genotype, VE is variance attributable to the environment, and VGxE is variance attributable to the interaction between the genotype and the environment. It is implausible that in the decades-long procurement of apes from the wild for display in zoos and for research and other purposes, hunters have somehow managed to select only those apes with ‘pointing genes’ or ‘pointing gene complexes.’ That is, apes in captivity are genotypically representative of apes in the wild. Therefore, VG can be dropped from the equation and, with respect to pointing,

VP = VE + VGxE

Thus, the heritability of pointing is nil; that is, the contribution of purely genetic variance to phenotypic variance is negligible. It follows that any account of the development of pointing in captive apes must invoke exogenous environmental factors. (See Danchin, Giraldeau, Valona, & Wagner, 2004, for an elaboration of the components of VE.) Hence, if we are to understand the development of pointing in captive apes, we must understand which of the multitudinous environmental differences between wild and captive apes are most relevant. This is not a straightforward task, chiefly because, like humans, apes are extraordinarily longlived and experience very long juvenile and adolescent epochs (Fragaszy & Bard, 1997; Tutin, 1994). This means that by the time researchers come to interact with or observe any particular ape, that ape may have experienced several decades of poorly-documented life experience. If we document pointing in, say, a 40-year-old chimpanzee, what can we say about how that chimpanzee might have acquired the behavior? In truth, very little. However, it is still not universally accepted that apes point (cf. Baron-Cohen, 1999; Povinelli, Bering, & Giambrone, 2003a). Therefore, before turning to a consideration of a candidate explanation for the development of pointing in captive apes and its implications for understanding the evolution of manual deixis, considerable discussion is warranted on the question of what I mean when I assert that apes in captivity point.

69

70

David A. Leavens

Nonverbal reference defined Because ‘reference’ is a term with both general and specialized meanings, a brief denotative digression is warranted. In symbolic reference, symbols are produced (e.g., “dog”) which, by virtue of a shared lexicon between speaker and listener, coordinates the attention of the interactants to a conceptual entity. That is, no dog needs to be immediately present for reference to occur. Symbolic reference therefore allows communicators to transcend the immediate sensory environment (a property called ‘displacement’). For some people, particularly linguists, ‘reference’ invokes a representational architecture in which a symbol ‘stands for’ a real or imaginary object, or ‘referent.’ For these researchers, who use the term ‘reference’ in this specialized sense, it is nonsensical to refer to pointing as ‘referential’ because in no meaningful sense does the gesture ‘stand for’ the event or object to which attention is being drawn. In the interest of clarity, therefore, it is important to emphasize that in the present paper, as in our previous articles (Leavens & Hopkins, 1998; Leavens et al., 1996, 2004a), I use the more general definition of reference, which means simply “to direct attention.” Manual deixis, or pointing, is thus an act of nonverbal reference (Adamson, 1996; Bates et al., 1987; Leavens et al., 2004a). It should be obvious that this use of the term ‘reference’ differs also from that used in studies of nonhuman primate vocal communication, in which evidence for semantic reference is offered; i.e., different vocalizations seem to be emitted in response to different types of predators or threats (e.g., Cheney & Seyfarth, 1990).

What is intentional communication? A second preliminary consideration concerns the definition of intentional communication. For the sake of brevity, in current usage there are essentially two ways to define it. In one camp, intentional communication is defined with reference to the exercise of will; a signaler intends to influence a social partner in a certain way. Intentional communication is, thus, defined with reference to the motivational state of signaler (which is unverifiable, in practice). The alternative definition emphasizes what we can objectively measure in a signaler or context. The operational definition of intentional communication that we use has been elaborated from that originally developed for the study of preverbal communication in human babies (e.g., Bates et al., 1975; Golinkoff, 1986; Sugarman, 1984; see Bard, 1992, for more extensive discussion, and see Rolfe, 1996). The first criterion is that it is used socially; that is, that it requires an audience. This criterion has been met in studies of the gestural communication of orangutans (Pongo pygmaeus) and chimpanzees,

Manual deixis in apes and humans

in samples of from two to 101 subjects (Call & Tomasello, 1994; Hostetter, Cantero, & Hopkins, 2001; Leavens et al., 1996; Leavens et al., 2004a). The second criterion is that the visual orienting behavior of the signaler is under the stimulus control of the locations of the object or event of apparent interest and the social partner (i.e., that the signaler looks back-and-forth between the social partner and a distant event or object). This criterion has been met by virtually all reported studies of the gesture use of apes in captivity, including samples of from two to 115 apes (Call & Tomasello, 1994; Krause & Fouts, 1997; Leavens & Hopkins, 1998; Leavens et al., 1996; Leavens et al., 2004a). The third criterion is that the signaler exhibits putative attention-getting behavior when the social partner is not looking at the signaler. Again, this criterion has been met in studies of from two to 57 apes (Krause & Fouts, 1997; Hostetter et al., 2001; Leavens, Hostetter, Wesley, & Hopkins, 2004b; Pika, Liebal, & Tomasello, 2003; Tomasello, Call, Nagell, Olguin, & Carpenter, 1994). Finally, intentional communication is defined by persistence and elaboration in the face of apparently failed communicative bids. Small-scale studies have established this criterion in chimpanzees (e.g., Menzel, 1999; Leavens et al., 1996) and forthcoming work will establish the same finding in a larger sample of chimpanzees (Leavens, Russell, & Hopkins, 2005). Hence, pointing by apes meets all the objective criteria for intentional communication originally defined with reference to the preverbal communication of human infants.

What is a point? A third preliminary consideration concerns the structure of pointing. Povinelli and Davis (1994) suggested that anatomical differences between apes and humans account for alleged species differences in the shape of the pointing gesture. But we know that chimpanzees who have received language training tend to point relatively frequently with the index finger (e.g., Krause & Fouts, 1997; Menzel, 1999; reviewed by Krause, 1997; Leavens & Hopkins, 1999). In contrast, non-languagetrained apes tend to point to objects with the whole hand, with all fingers extended, though there are individual differences and some pointing with the index finger is exhibited by some language-naive chimpanzees in virtually all of our studies (e.g., Leavens & Hopkins, 1998; Leavens et al., 1996, 2004a; see Call & Tomasello, 1994, for similar findings with orangutans). Whether language-training directly or only incidentally influences the number of fingers extended while pointing is currently an open question (Call & Tomasello, 1994; Krause & Fouts, 1997; Leavens & Hopkins, 1999).

71

72

David A. Leavens

There is evidence that pointing with the whole hand serves a different function for young humans from pointing with the index finger: Butterworth (e.g., 2003; Franco & Butterworth, 1996) suggested that pointing with the whole hand serves to request objects or actions on objects, whereas pointing with the index finger serves to “comment” upon something in the world. In a cross-sectional study, Franco and Butterworth (1996) found that whole-handed gestures did not change in relative frequency from 12 to 18 months of age, whereas the incidence of index-finger pointing increased dramatically over the same age range. Thus, in that study, only pointing with the index finger seemed subject to developmental change in frequency of use. Blake, O’Rourke, and Borzellino (1994) reported that no less than 87% of the gestures exhibited by one-year-old human infants to outof-reach food comprised what they called ‘reach-outs’ (extension of all fingers of the hand), again suggesting a requesting function to whole-handed gestures. Iverson and Goldin-Meadow (1997, 2001) reported that when sighted children from 9 to 18 years of age were blindfolded and required to give verbal directions or re-tell a story, they exhibit strikingly less pointing with the index finger, compared to sighted children who are not blindfolded, and exhibit relatively more pointing with the whole hand. Wilkins (2003) has noted the cross-cultural variability with which people indicate distant objects. Taken together, these findings suggest that the form of pointing in humans is sensitive to contextual manipulations. Thus, if some apes (and humans) exhibit an overwhelming reliance on the index finger for pointing, and if others seem to prefer to indicate distant objects with their whole hands extended, then on what basis can anatomical differences between human and chimpanzee hands (and there are many such differences) be invoked to account for differences in the structure of pointing?

What does pointing do? A fourth preliminary consideration concerns the function of pointing. In human developmental research it has been well-established that human infants, after approximately one year of age, point with two distinctly different apparent goals. Sometimes, they point to objects in requestive contexts — this is frequently referred to as protoimperative or imperative pointing (e.g., Bates et al., 1975; Baron-Cohen, 1999). Protoimperative gestures seem to function to request others to act on the world in some way, for example, to deliver otherwise unreachable food or toys. On other occasions, children seem to point as though merely sharing attention to some distant object or event with a social partner is the end in itself. This latter kind of gesture is typically referred to as protodeclarative or simply,

Manual deixis in apes and humans

declarative (e.g., Bates et al., 1975; Baron-Cohen, 1999). These terms were used by Bates and her colleagues to describe the preverbal communication of infants. Imperative speech serves to demand or request things of a social partner, hence Bates et al. (1975) termed apparent requests by preverbal humans “protoimperatives,” suggesting continuity in humans between preverbal and later verbal requests. Correspondingly, declarative speech serves to comment upon the world, and preverbal communication with the same apparent goal was termed “protodeclarative,” again implying continuity in preverbal and later verbal commenting. Because it is not coherent to write of “proto-” imperatives or declaratives in animals who will never exhibit symbolic communication (Leavens & Hopkins, 1998), I will refer to apparent nonverbal requests as “imperatives” and apparent bids to establish shared attention to some distant event or object as “declaratives.” More recently, these terms have been used in ways which imply that the distinction between imperative and declarative communication may mark a human developmental transition to a nascent theory of mind (e.g., Baron-Cohen, 1999; Legerstee & Barillas, 2003; Tomasello, 1999). According to this perspective, declarative communication implies that the signaler is attempting to manipulate another’s state of mind, implying further that the signaler recognizes, at some level, that their social partners have perspectives and mental contents which differ from the signaler’s. Imperative communication, on the other hand, implies only that the signaler is attempting to manipulate a social partner’s behavior. It is empirically true that apparent preverbal requests, or imperative gestures, do seem to develop in humans prior to apparent declarative gestures (e.g., Bates et al., 1987). Hence, pointing to share attention, or declarative pointing, may index greater maturity and sophistication in the communication of developing infants near the end of the first year of life. The implication which is typically drawn for comparative psychology is that because apes allegedly do not gesture declaratively, but only imperatively (e.g., Baron-Cohen, 1999; Butterworth, 2001; Povinelli, Theall, Reaux, & Dunphy-Leli, 2003b), therefore they do not recognize the mentality of their social partners. There are at least four grounds on which this generalization can be questioned. First, there are several reports of apparent declarative pointing by apes. SavageRumbaugh, Shankar, & Taylor (1998) wrote of Matata, a female bonobo (mother of Kanzi, Pan paniscus): “when she heard unusual sounds in the forest, she would direct my [Savage-Rumbaugh’s] attention toward them by looking and gesturing in that direction” (p. 11). Miles (1990) described several instances of apparent declarative pointing by a language-trained orangutan, Chantek. Ape language researchers report numerous allegedly declarative acts by apes using sign language or other non-vocal languages (e.g., Gardners & Gardner, 1971; Miles, 1990; Savage-

73

74

David A. Leavens

Rumbaugh, 1986; Savage-Rumbaugh et al., 1998). Strikingly, the only published report of pointing by a wild bonobo appears to be a quintessentially declarative gesture: this bonobo pointed to the location of (not so very well) hidden observers and alternated his gaze between these human observers and the rest of his troop, following behind him (Veá & Sabater-Pi, 1998). With the exception of the report by Veá and Sabater-Pi (1998), what differentiates these particular apes from other apes in captivity is that they have experienced unusually close emotional bonding with human caregivers, usually (but not always) in the context of language training. Although reports of apparently declarative communication are relatively scarce, so are rearing histories in which captive apes experience intensely close emotional bonding with trained human observers. Thus, a substantial proportion of those few apes who experience these unusually intimate and emotionally rich relationships with human caregivers also exhibit declarative pointing. Hence, although the evidence for declarative pointing in apes is based on very small samples and relatively little systematic study; because these behaviors are so commonly reported in these special populations, I will tentatively accept the evidence at face value, acknowledging that future research in this domain is warranted. The relevance of emotionality to understanding declarative gestures is highlighted by the second basis for doubting the human species-specificity of declarative gestural communication: humans who have experienced profound early social deprivation also exhibit deficits in communication at rates far above that seen in the general population (Rutter, Andersen-Wood, Beckett, Bredenkamp, Castle, Groothues et al., 1999; see also Hobson, 2002; Hobson & Bishop, 2003). What these findings suggest is that what we conceive of as normal human communicative development depends in no little part on the quality of babies’ early emotional bonding. If deprivation can adversely influence communicative development in humans, then it is not implausible to suggest that the kinds of institutional rearing conditions experienced by most captive apes would adversely influence their motivation to share attention with humans. It is at least plausible that the apparent paucity of observations of declarative gesturing by apes is attributable, in part, to a rearing history influence of deprivation on motivation rather than a primary cognitive deficit. Third, imperative communication about distant objects is a more behaviorally complex activity than is declarative communication, traditionally defined. The usual portrayal of a declarative act (e.g., “look at that”) goes something like the following: a signaler captures the attention of a social partner and re-directs the partner’s attention to some distant object or event. Imperative gestures (e.g., “give me that”) require the further elaboration that the signaler expects the observer to manipulate the world in some way. In early human infancy and in chimpanzee

Manual deixis in apes and humans

communication, what action is expected is often signaled by context, in relation to the specific interactional histories of the individuals involved; that is, meaning seems to be co-constructed over a history of interactions between specific individuals (e.g., Shankar & King, 2002; Tomasello, 1999; Tomasello & Call, 1997; Tomasello et al., 1994). Hence, from a strictly behavioral perspective, imperative nonverbal communication subsumes the dynamic mechanics of declarative nonverbal communication and requires a bit more elaboration. If, on the other hand, signalers exhibit declarative communication with some expectation that there will be a response from the social partner over and above the mere contemplation of the distant object or event (e.g., Brinck, 2001; Liszkowski, Carpenter, Henning, Striano, & Tomasello, 2004), then both imperative and declarative acts reduce to the same level of behavioral complexity. Both are triadic and they differ only with respect to the putative ends (“give me that” vs. “engage with me”), with essentially no meaningful difference in the cognitive prerequisites for these two kinds of communicative act (Moore & Corkum, 1994). Finally, as also noted by Povinelli et al. (2003a), we have no observational or experimental evidence that young human babies who point, apparently declaratively, also discriminate or recognize mental states in their social partners (cf. Moore & Corkum, 1994). Some writers argue that because declarative pointing is accompanied by gaze alternation between a social partner and a distant object or event of interest, that this implies the awareness on the part of the baby that others have attentional states which differ from their own (e.g., Franco & Butterworth, 1996; Tomasello, 1995). Adolescent and adult chimpanzees who gesture also exhibit concomitant gaze alternation between unreachable food and human experimenters during 80% to 100% of their gestures, which is a much higher rate than displayed by human infants before about two years of age (Leavens & Hopkins, 1998; Leavens et al., 2004a; reviewed by Leavens & Hopkins, 1999). If visual monitoring of a social partner is a ‘smoking gun’ that implicates the awareness of the signaler that the social partner is a mental being, then either (a) chimpanzees also recognize mental states or (b) some compelling argument has to be made that gaze alternation implicates nascent mental state reasoning in human children, but not other animals (e.g., Povinelli et al., 2003a). Previously, we argued that gaze alternation implicated mental state reasoning in both humans and chimpanzees (Leavens et al., 1996). Since that time, we have come to the view that gaze alternation accompanying gestural behavior does not implicate mental state awareness in any species, including humans (Leavens et al., 2004a, although it is consistent with such an interpretation, e.g., Tomasello, 1995). We cannot directly measure the hypothetical motivational or volitional components of communicative behavior in any species, including humans, independently of their overt behavior (Bergmann,

75

76

David A. Leavens

1962; Leavens, 2002; Leavens et al., 2004b), including verbal behavior. Whether this is a mere technical limitation that will be overcome with advances in medical imaging technology (cf. Bergmann, 1962) or an indictment of the currently widespread assumption that folk psychologies accurately reflect psychological processes is an open question (Leavens, 2002; Leavens et al., 2004a,b; cf. Thompson, 1997). Hence, the empirical fact that both human infants and adolescent and adult chimpanzees frequently accompany their manual gestures with successive visual orienting between objects or events of apparent interest and their social partners, in both declarative and imperative contexts, does not uniquely implicate the possession by the signaler of abstract representations of their social partners’ mental functioning (see also Brinck, 2001).

Environmental correlates of pointing in apes and humans Which, among the many environmental factors that differ between wild and captive apes might account for the ubiquity of pointing in captive populations? We may be decades away from a truly inductive approach to this question. This is because we lack sufficient experimental control over the pre-experimental life histories of chimpanzees and to gain adequate experimental control will take extraordinary resources and time not heretofore deployed in the study of the development of manual gestures in apes. As noted above, there are striking differences in how language-trained apes and non-language-trained apes point, most obviously in the former’s frequent use of the index finger and apparent declarative behavior. Hence, for these reasons, because at the current level of empirical knowledge an inductive approach to this question is not feasible, then a deductive approach is necessitated by the paucity of data. Call and Tomasello (1996) outlined a hierarchy of ‘enculturation’ in which captive apes can be categorized according to their degree of intimacy with their human caregivers. Some captive apes, particularly those raised in biomedical research institutions, might experience as little as four minutes per day of positive face-to-face interaction with humans (Bard, unpublished data) whereas others, particularly language-trained and home-reared apes, experience many frequent daily, intense, affect-laden interactions with humans (e.g., Kellogg & Kellog, 1933). Thus, in terms of rearing histories, it is naive to characterize any particular captive ape population as being representative either of all captive apes or, worse, all apes of that species. Examples are legion in which researchers, having studied a chimpanzee, or a handful of chimpanzees, subsequently expound upon the behavior of ‘The Chimpanzee’; as though their particular subjects, with their particular rearing histories, were meaningfully representative of the species.

Manual deixis in apes and humans

Table 2. Pointing in apes and humans with respect to three environmental variables Barrier? History of Emotional Imperative Declarative delivery? responsivity? pointing? pointing? Apes Wild

No

No

Yes

Rare

Rare

Captive (institutional)

Yes

Yes

No

Yes

Rare

Captive (home-reared) Yes

Yes

Yes

Yes

Yesa

Humans (6–15 months) Impoverished

Yes

Yes

No

Yes

Delayed

Typical Western

Yes

Yes

Yes

Yes

Yes

aAlthough

there are relatively few observations of home-reared or language-trained apes exhibiting protodeclarative behaviors, weight is given here to the fact that these few observations constitute a very large fraction of the relatively few apes who have experienced these unusual rearing histories.

However, because apes subject to the entire range of possible rearing histories in captivity exhibit pointing in the absence of explicit training, it is not unreasonable to consider possible similarities across the range of captive rearing conditions for clues to the advent of pointing. As a first approximation, I will consider three factors or dimensions of life experience: barriers, history of delivery, and emotional responsiveness (see Table 2). The term barriers refers to obstacles to free movement; these can be exogenous barriers, such as cage mesh, or endogenous barriers, such as locomotor immaturity (Leavens et al., 1996). The effect of these endogenous and exogenous barriers on communicative development is that, in the presence of desirable, but unreachable objects, organisms are put into a problem space that is not characteristic of the natural habitats of wild apes. Chimpanzees, for example, exhibit independent quadrupedal locomotion by 4 to 5 months of age (van LawickGoodall, 1968). In captivity, but not in the wild, apes face a frequent problem in which desirable objects are visible, but unreachable, due to intervening cage mesh or bars. Human children, who do not develop mature bipedal locomotion until a year of age, and who are frequently restrained in high chairs, cribs, and the like, face a very similar problem space with both endogenous and exogenous factors constraining them from directly attaining objects of interest. When captive apes and humans face these barriers, they also frequently experience circumstances in which caregivers deliver items to them. To the degree that delivery is contingent on the signaling behavior of the ape or human infant, then a means becomes established, through interaction, or ontogenetic ritualization (see, e.g., Tomasello, 1999; Tomasello & Call, 1997). If barriers coupled with histories of delivery can account for the development of pointing, then why don’t human infants tend to start pointing earlier than the

77

78

David A. Leavens

10–12-month average age of pointing onset? In Piagetian terms, this would be explained by the pattern of development of coordinated secondary circular reactions (Stage 4 in the sensorimotor period), the ability to relate means (caregiver) to specific ends (unreachable objects), which develops from 8 to 12 months of age (cf. Sugarman, 1984). Harding and Golinkoff (1979) found relationships between intentional vocalizations and Stage 5 sensorimotor intelligence, but 31% of the children adjudged to be Stage 4 with respect to the object concept and 36% of the children adjudged to be transitional between Stage 4 and Stage 5 in terms of their understanding of physical causality also exhibited intentional vocalizations. Harding and Golinkoff therefore argued that “attainment of a specific level of development of the object concept does not appear either to be necessary or sufficient for the transition into [intentional communication]” (1979, p. 37). Sugarman (1984) reported that children in Stage 4 began to exhibit “coordinated person-object interaction” at 8–10 months of age. Bates, Thal, & Marchman (1991) noted that tool use, causal understanding, and deictic gestures all development in roughly the same epoch, the 9 to 10 month period (see their Table 2.1). Hence, across a variety of human developmental studies relating communication to sensorimotor cognition, there is striking temporal congruity between the age at onset of intentional communication and display of late Stage 4/early Stage 5 sensorimotor intelligence. In feral apes, retrieval of distant objects by caregivers is obviated by infant apes’ ability to independently locomote to within reach of objects of interest, and this locomotor independence precedes the cognitive milestone of coordinated secondary circular reactions in apes (e.g., Gibson, 1996; Parker, 1999; Potì & Spinozzi, 1994). In the case of human infants, the precipitating condition of prolonged confinement extends through the latter half of the first year of life, when the ability to co-ordinate actions on objects with actions on social agents first develops. Hence, heterochrony, or changes in the timing of development, in humans creates a problem space for humans which wild apes do not experience, or experience rarely. As noted above, captive apes experience this problem space frequently (e.g., Leavens et al., 1996). Two observations of pointing by captive apes for other apes illustrate these points (and, further, serve to refute incorrect claims that apes do not point amongst themselves, e.g., Butterworth, 2001; 2003; Povinelli et al., 2003a). Savage-Rumbaugh (1986) reported 37 instances in which Sherman and Austin, two language-trained chimpanzees, pointed in communication between themselves. What distinguishes Sherman and Austin from most other captive apes is that they were explicitly raised in a food-sharing culture; that is, they were trained to share and to take turns from an early age. Hence, they were often placed in experimental circumstances in which their training required them to await the actions of the other. In these circumstances, they frequently pointed, apparently to draw the

Manual deixis in apes and humans

attention of the other to the correct response, or to items of fallen food. Both the inhibition to act directly (barrier) and history of delivery (food-sharing) were explicitly trained. The second observation, by de Waal (1982) at the Arnhem Zoo, can be summarized as follows: two chimpanzee juveniles were playing together, when the play descended into an angry brawl. One mother, Tepel, prodded the matriarch of the group, Mama, who was napping nearby, and then pointed in the direction of the fighting juveniles. Mama subsequently waded between the antagonists and separated them. De Waal’s interpretation of this event was that because Tepel feared reprisal from the mother of the other juvenile, a female named Jimmie, her fear of reprisal (barrier) prevented her from directly intervening in the fight, so she enlisted Mama’s assistance (Mama being the dominant female, hence not subject to negative consequences). Tepel thus experienced the problem space in which a desired outcome required the capture and re-direction of the attention of an ally. According to this argument, then, imperative pointing develops in the context of barriers to direct retrieval of desirable objects given histories of delivery by caregivers. Why, then, don’t captive monkeys, who experience similar problem spaces, also point? In fact, some monkeys apparently do spontaneously point (Hess, Novak, & Povinelli, 1993; Mitchell & Anderson, 1997), but it seems to be much rarer than pointing by apes. Because the impact of barriers on the efficacy of humans, apes, and monkeys to act on their environments are similar, more research into the communicative interactions between captive monkeys and their human caregivers is warranted. A number of studies that explicitly trained pointing in monkeys have noted that this training seems to lead to some generalized facility in other domains of social cognition, including comprehension of pointing, the production of declarative pointing, and imitation (Blaschke & Ettlinger, 1987; Kumashiro, Ishibashi, Itakura, & Iriki, 2002; Kumashiro, Ishibashi, Uchiyama, Itakura, Murata, & Iriki, 2003). Thus, on the one hand it would be premature, given the paucity of data on the subject, to conclude that there is a fundamental difference between apes and monkeys in their capacities for manual deixis, whereas, on the other hand, there are very few reports of spontaneous pointing by monkeys (two, to my knowledge: Hess et al., 1993; Mitchell & Anderson, 1997). The evidence tentatively suggests that once monkeys have a certain competence in following and manipulating attention in humans, then this may facilitate performance in other domains of social cognition. This may also be true for both apes and humans. Perusal of Table 2 reveals that captive apes who exhibit apparently declarative pointing are distinguished by rearing histories of close emotional bonding with and emotional responsiveness by human caregivers. Examples of humans who experience profound early social deprivation are rare, but Rutter et al. (1999) described profound communicative deficits in their sample of 111 orphans raised

79

80

David A. Leavens

in such deprived circumstances (see Hobson, 2002; Hobson & Bishop, 2003 for extended discussions of the impact of social factors on communicative development in humans). The suggestion put forward here is that unless apes or human infants experience rearing histories in which the affective exchange that accompanies shared attention to distant objects (gleeful vocalizations, smiling, hugs, etc.) become reinforcing, through experience, the motivational basis for exhibiting declarative gestures is undermined. Thus, I hypothesize that gesturing imperatively may provide a foundation for later generalization of pointing in declarative contexts; that is, gesturing to get somebody to act on objects provides a behavioral template for later gesturing to get somebody to engage in positive, shared emotional states. Because wild apes virtually never experience a problem space conducive to the development of imperative pointing, this obviates its later generalization to contexts eliciting emotional engagement. The ‘driving force’ for the development of declarative communication being offered here is the motivational basis for sharing attention and this is seen as being subject to the emotional consequences of sharing attention to distant objects or events with another. According to this view, naming behavior, which is a frequent activity in the lives of human infants and language-trained apes, and which involves shared attention to distant objects, is accompanied by high rates of positive affective signaling on the part of the caregiver. Joint attention is thus socialized and the reinforcing effects of caregivers’ smiles, gleeful vocalizations, etc., are manifested in no little part on a foundation of emotional exchanges. Or in other words, for apes and humans in these social circumstances, affective exchange becomes desirable as an end in itself (see, e.g., Adamson, 1996; Adamson & Bakeman, 1985; Gómez, 1998; Moore & Corkum, 1994).

Implications for the evolution of manual deixis Given that similar problem-solving capabilities, particularly the ability to use an object to obtain otherwise unreachable items, develop in most humans and both wild and captive apes (see, e.g., Bard, 1990; Parker & Gibson, 1977), then manual deixis can be seen as a problem-solving behavioral adaptation to the ubiquitous problem space posed for hominids once the development of independent locomotion became so protracted that it extended into stages of infancy in which agentobject coordinative skills were simultaneously developing. Hence, manual deixis became a human epigenetic consequence of the adaptation to bipedalism. When, in hominid evolution, bipedalism became an obligate, rather than facultative, mode of locomotion is subject to considerable current debate, with controversial

Manual deixis in apes and humans

claims for the first origins of bipedal locomotion in excess of 6 million years ago (Senut, Pickford, Gommery, Mein, Cheboi, & Coppens, 2001). From the standpoint of human linguistic evolution, because joint attention is foundational to such early precursors of language use as naming of objects (cf. Butterworth, 2003; Baldwin, 1995), then the following evolutionary scenario assumes some plausibility under a loosely recapitulationist view. First, the period of locomotor immaturity extended, sometime between 6 and 4 million years ago, into later stages of cognitive development, particularly coordinated secondary circular reactions and tertiary circular reactions. Infants during this time would have increasingly been forced to tactics of manipulation of their caregivers by their inherent inability to retrieve objects for themselves. Then, between 4 and 2.5 million years ago, as the hominid ontogenetic environment became ever more rich in artifacts, basic lexicons developed. The assumption here is that there is a rubicon of artifactual complexity above which efficient communication becomes subject to selection (see, e.g., Tomasello, 1999). (I.e., some means of communicating “No, not the chopper, the spear lying next to the chopper” becomes selected for as the complexity of material culture increases beyond some, currently ill-defined, minimum). As hypothesized by Corballis (2002), this may have been initially in the visual domain, through iconic gestures. However, in part because chimpanzees in captivity use vocal signals tactically in attention-getting functional contexts (Hostetter et al., 2001; Leavens et al., 2004b), I have argued that the emergence of language was probably multimodal (visual and vocal) from its inception, involving simultaneous use of vocal and gestural components (Leavens, 2003; see also Gibson, 1996; Lock, 1983). It is manual deixis, the ability to capture and redirect the visual attention of a social partner to some specific entity through manual gestures, that is shared by humans and their nearest living relatives, given certain commonalities in their social environments, including exposure to the problem space in which manipulation of others is the only viable solution to the problem (as in the retrieval of otherwise unreachable objects). To summarize, because apes point in captivity, and because they don’t require explicit training to do this, pointing is not necessarily derived from the neurobiological or cognitive adaptations for symbolic communication in the human lineage. The epigenetic scenario sketched here, speculative though it is, does account for the near-ubiquity of pointing in captive apes and in humans (but see Wilkins, 2003), as well as its apparent scarcity in wild ape populations.

81

82

David A. Leavens

Acknowledgement I would like to thank the following people for discussion of issues raised in this manuscript: Kim A. Bard, University of Portsmouth, the late George Butterworth, University of Sussex, Dorothy M. Fragaszy, University of Georgia, Fabia Franco, Middlesex University, William D. Hopkins, Yerkes National Primate Research Center, Jana Iverson, University of Missouri, Mark A. Krause, University of Portland, Chris Moore, Dalhousie University, Ed Mulligan, University of Georgia, Roger Thomas, University of Georgia, Nicholas S. Thompson, Clark University, an anonymous reviewer, and the other participants in the Vocalize to Localize conference, University of Grenoble, January 30–February 2, 2003.

References Adamson, L. R. (1996). Communication development during infancy. Boulder, CO: Westview Press. Adamson, L. R., & Bakeman, R. (1985). Affect and attention: Infants observed with mothers and peers. Child Development, 56, 582–593. Baldwin, D. A. (1995). Understanding the link between joint attention and language. In C. Moore & P. J. Dunham (Eds.), Joint attention: Its origins and role in development (pp. 131–158). Hillsdale, NJ: Lawrence Erlbaum. Bard, K. A. (1990). “Social tool use” by free‑ranging orangutans: A Piagetian and developmental perspective on the manipulation of an animate object. In S. T. Parker & K. R.Gibson (Eds.), “Language” and intelligence in monkeys and apes: Comparative developmental perspectives (pp. 356‑378). Cambridge: Cambridge University Press. Bard, K. A. (1992). Intentional behavior and intentional communication in young free‑ranging orangutans. Child Development, 62, 1186‑1197. Baron-Cohen, S. (1999). The evolution of a theory of mind. In M. C. Corballis & S. E. G. Lea (Eds.), The descent of mind: Psychological perspectives on hominid evolution (pp. 261–277). Oxford: Oxford University Press. Baron-Cohen, S., Cox, A., Baird, G., Swettenham, J., Nightingale, N., Morgan, K., Drew, A., & Charman, T. (1996). Psychological markers of autism at 18 months of age in a large population. British Journal of Psychiatry, 168, 158–163. Bates, E., Camaioni, L., & Volterra, V. (1975). Performatives prior to speech. Merrill-Palmer Quarterly, 21, 205–226. Bates, E., O’Connell, B., & Shore, C. (1987). Language and communication in infancy. In J. Osofsky (Ed.), Handbook of infant development (pp. 149–203). New York: Wiley. Bates, E., Thal, D., & Marchman, V. (1991). Symbols and syntax: A Darwinian approach to language development. In N. A. Krasnegor, D. M. Rumbaugh, R. L. Schiefulbusch, & M. Studdert-Kennedy (Eds.), Biological and behavioral determinants of language development (pp. 29–65). Hillsdale, NJ: Erlbaum. Bergmann, G. (1962). Purpose, function, scientific explanation. Acta Sociologica, 5, 225–238. Blake, J., O’Rourke, P., & Borzellino, G. (1994). Form and function in the development of pointing and reaching gestures. Infant Behavior and Development, 17, 195–203.

Manual deixis in apes and humans

Blaschke, M., & Ettlinger, G. (1987). Pointing as an act of social communication by monkeys. Animal Behaviour, 35, 1520–1523. Bonvillian, J. D., & Patterson, F. G. P. (1999). Early sign-language acquisition: Comparisons between children and gorillas. In S. T. Parker, R. W. Mitchell, & H. Lyn Miles (Eds.), The mentalities of gorillas and orangutans: Comparative perspectives (pp. 240–264). Cambridge, U.K.: Cambridge University Press. Brinck, I. (2001). Attention and the evolution of intentional communication. Pragmatics & Cognition, 9, 255–272. Butterworth, G. (2001). Joint visual attention in infancy. In J. G. Bremner & A. Fogel (Eds.), Blackwell handbook of infant development (pp. 213–240). Hove: Blackwell. Butterworth, G. (2003). Pointing is the royal road to language for babies. In S. Kita (Ed.), Pointing: Where language, culture, and cognition meet (pp. 9–33). Mahwah, NJ: Erlbaum. Butterworth, G., & Grover, L. (1988). The origins of referential communication in human infancy. In L. Weiskrantz (Ed.), Thought without language (pp. 5–24). Oxford: Clarendon Press. Call, J., & Tomasello, M. (1994). Production and comprehension of referential pointing by orangutans (Pongo pygmaeus). Journal of Comparative Psychology, 108, 307‑317. Call, J., & Tomasello, M. (1996). The effect of humans on the cognitive development of apes. In A. E. Russon, K. A. Bard, & S. T. Parker (Eds.), Reaching into thought: The minds of the great apes (pp. 371–403). Cambridge: Cambridge University Press. Carpenter, M., Nagell, K., & Tomasello, M. (1998). Social cognition, joint attention, and communicative competence from 9 to 15 months of age. Monographs of the Society for Research in Child Development 63. Cheney, D. L., & Seyfarth, R. M. (1990). How monkeys see the world: Inside the mind of another species. Chicago: Chicago University Press. Corballis, M. C. (2002). From hand to mouth: The origins of language. Princeton, NJ: Princeton University Press. Danchin, I., Giraldeau, L.-A., Valone, T. J., & Wagner, R. H. (2004). Public information: From nosy neighbors to cultural evolution. Science, 305, 487–491. Fragaszy, D. M. & Bard, K. A. (1997). Comparisons of development and life history in Pan and Cebus. International Journal of Primatology, 18, 683–701. Franco, F., & Butterworth, G. (1996). Pointing and social awareness: Declaring and requesting in the second year. Journal of Child Language, 23, 307–336. Forness. W.H. (1916). Observations on the mentality of chimpanzees and orang-utangs. Proceedings of the American Philosophical Society, 55, 281–290. Gardner, B. T., & Gardner, R. A. (1971). Two‑way communication with an infant chimpanzee. In A. M. Schrier & F. Stollnitz (Eds.), Behavior of nonhuman primates: Modern research trends, Vol. 4 (pp. 117‑183). New York: Academic Press. Gibson, K. (1996). The ontogeny and evolution of the brain, cognition, and language. In A. Lock & C. R. Peters (Eds.), Handbook of human symbolic evolution (pp. 407–431). Hove, U.K.: Blackwell. Golinkoff, R. M. (1986). ‘I beg your pardon?’: The preverbal negotiation of failed messages. Journal of Child Language, 13, 455–476. Gómez, J.‑C. (1998). Do concepts of intersubjectivity apply to non-human primates? In S. Bråten (Ed.), Intersubjective communication and emotion in early ontogeny (pp. 245–259). Cambridge, U.K.: Cambridge University Press.

83

84

David A. Leavens

Harding, C. G., & Golinkoff, R. M. (1979). The origins of intentional vocalizations in prelinguistic infants. Child Development, 50, 33–40. Hess, J., Novak, M. A. & Povinelli, D. J. (1993). ‘Natural pointing’ in a rhesus monkey, but no evidence of empathy. Animal Behaviour, 46, 1023–1025. Hobson, R. P. (2002). The cradle of thought: Exploring the origins of thinking. London: Macmillan. Hobson, R. P., & Bishop, M. (2003). The pathogenesis of autism: Insights from congenital blindness. Philosophical Transactions of the Royal Society of London, Series B, 358, 335–344. Hostetter, A. B., Cantero, M. & Hopkins, W. D. (2001). Differential use of vocal and gestural communication by chimpanzees (Pan troglodytes) in response to the attentional status of a human (Homo sapiens). Journal of Comparative Psychology, 115, 337–343. Inoue-Nakamura, N., & Matsuzawa, T. (1997). Development of stone tool use by wild chimpanzees (Pan troglodytes). Journal of Comparative Psychology, 111, 159–173. Iverson, J. M. & Goldin-Meadow, S. (1997). What’s communication got to do with it? Gesture in children blind from birth. Developmental Psychology, 33, 453–467. Iverson, J. M. & Goldin-Meadow, S. (2001). The resilience of gesture in talk: Gesture in blind speakers and listeners. Developmental Science, 4, 416–422. Kellogg, W. N., & Kellogg, L. A. (1933). The ape and the child: A study of early environmental influence upon early behavior. New York: McGraw-Hill. Krause, M. A. (1997). Comparative perspectives on pointing and joint attention in children and apes. International Journal of Comparative Psychology, 10, 137–157. Krause, M. A. & Fouts, R. S. (1997). Chimpanzee (Pan troglodytes) pointing: Hand shapes, accuracy, and the role of eye gaze. Journal of Comparative Psychology, 111, 330–336. Kumashiro, M., Ishibashi, H., Itakura, S., & Iriki, A. (2002). Bidirectional communication between a Japanese monkey and a human through eye gaze and pointing. Cahiers de Psychologie/Current Psychology of Cognition, 21, 2–32. Kumashiro, M., Ishibashi, H., Uchiyama, Y., Itakura, S., Murata, A., & Iriki, A. (2003). Natural imitation induced by joint attention in Japanese monkeys. International Journal of Psychophysiology, 50, 81–99. Leavens, D. A. (2002). On the public nature of communication. The Behavioral and Brain Sciences, 25, 630–631. Leavens, D. A. (2003). Integration of visual and vocal communication: Evidence for Miocene origins. The Behavioral and Brain Sciences, 26, 232–233. Leavens, D. A. & Hopkins, W. D. (1998). Intentional communication by chimpanzees: A crosssectional study of the use of referential gestures. Developmental Psychology, 34, 813–822. Leavens, D. A. & Hopkins, W. D. (1999). The whole hand point: The structure and function of pointing from a comparative perspective. Journal of Comparative Psychology, 113, 417–425. Leavens, D. A., Hopkins, W. D. & Bard, K. A. (1996). Indexical and referential pointing in chimpanzees (Pan troglodytes). Journal of Comparative Psychology, 110, 346–353. Leavens, D. A., Hopkins, W. D., & Thomas, R. K. (2004a). Referential communication by chimpanzees (Pan troglodytes). Journal of Comparative Psychology, 118, 48–57. Leavens, D. A., Hostetter, A. B., Wesley, M. J., & Hopkins, W. D. (2004b). Tactical use of unimodal and bimodal communication by chimpanzees (Pan troglodytes). Animal Behaviour, 67, 467–476. Leavens, D. A., Russell, J.L., & Hopkins, W.D. (2005). Intentionality as measured in the persistence and elaboration of communication by chimpanzees (Pan troglodytes). Child Development, 76, 291–306.

Manual deixis in apes and humans

Legerstee, M., & Barillas, Y. (2003). Sharing attention and pointing to objects at 12 months: Is the intentional stance implied? Cognitive Development, 18, 91–110. Liszkowski, U., Carpenter, M., Henning, A., Striano, T., & Tomasello, M. (2004). Twelve-montholds point to share attention and interest. Developmental Science, 7, 297–307. Lock, A. (1983). “Recapitulation” in the ontogeny and phylogeny of language. In E. de Grolier (Ed.), Glossogenetics: The origin and evolution of language (pp. 91–114). Paris: Harwood Academic Publishers. Menzel, C. R. (1999). Unprompted recall and reporting of hidden objects by a chimpanzee (Pan troglodytes) after extended delays. Journal of Comparative Psychology, 113, 426–434. Miles, H. L. (1990). The cognitive foundations for reference in a signing orangutan. In S. T. Parker & K. R. Gibson (Eds.), “Language” and intelligence in monkeys and apes: Comparative developmental perspectives (pp. 511‑539). Cambridge: Cambridge University Press. Mitchell, R. W., & Anderson, J. R. (1997). Pointing, withholding information, and deception in capuchin monkeys (Cebus apella). Journal of Comparative Psychology, 111, 351–361. Moore, C., & Corkum, V. (1994). Social understanding at the end of the first year of life. Developmental Review, 14, 349–372. Parker, S. T. (1999). The development of social roles in the play of an infant gorilla and its relationship to sensorimotor intellectual development. In S. T. Parker, R. W. Mitchell, & H. Lyn Miles (Eds.), The mentalities of gorillas and orangutans: Comparative perspectives (pp. 367–393). Cambridge, U.K.: Cambridge University Press. Parker, S. T., & Gibson, K. R. (1977). Object manipulation, tool use and sensorimotor intelligence as feeding adaptations in Cebus monkeys and great apes. Journal of Human Evolution, 6, 623–641. Pika, S., Liebal, K., & Tomasello, M. (2003). Gestural communication in young gorillas (Gorilla gorilla): gestural repertoire, learning, and use. American Journal of Primatology, 60, 95–111. Potì, P., & Spinozzi, G. (1994). Early sensorimotor development in chimpanzees (Pan troglodytes). Journal of Comparative Psychology, 108, 93–103. Povinelli, D. J., & Davis, D. R. (1994). Differences between chimpanzees (Pan troglodytes) and humans (Homo sapiens) in the resting state of the index finger: Implications for pointing. Journal of Comparative Psychology, 108, 134–139. Povinelli, D. J., Bering, J., & Giambrone, S. (2003a). Chimpanzee ‘pointing’: Another error of the argument by analogy? In S. Kita (Ed.), Pointing: Where language, culture, and cognition meet (pp. 35–68). Hillsdale, NJ: Erlbaum. Povinelli, D. J., Theall, L. A., Reaux, J. E., & Dunphy-Lelii, S. (2003b). Chimpanzees spontaneously alter the location of their gestures to match the attentional orientation of others. Animal Behaviour, 66, 71–79. Rolfe, L. (1996). Theoretical stages in the prehistory of grammar. In A. Lock & C. R. Peters (Eds.), Handbook of human symbolic evolution (pp. 776–792). Hove, U.K.: Blackwell. Rutter, M., Anderson-Wood, L., Beckett, C., Bredenkamp, D., Castle, J., Groothues, C., Kreppner, J., Keaveney, L., Lord, C., O’Connor, T. G., & the English and Romanian Adoptees (ERA) Study Team. (1999). Quasi-autistic patterns following severe early global privation. Journal of Child Psychology, 40, 537–549. Savage-Rumbaugh, E. S. (1986). Ape language: From conditioned response to symbol. New York: Columbia University Press. Savage-Rumbaugh, E. S., Shankar, S. G., & Taylor, T. J. (1998). Apes, language, and the human mind. Oxford: Oxford University Press.

85

86

David A. Leavens

Savage-Rumbaugh, E. S., Wilkerson, B. J., & Bakeman, R. (1977). Spontaneous gestural communication among conspecifics in the pygmy chimpanzee (Pan paniscus). In G. H. Bourne (Ed.), Progress in Ape Research (pp. 97–116). New York: Academic Press. Senut, B., Pickford, M., Gommery, D., Mein, P., Cheboi, K., & Coppens, Y. (2001). First hominid from the Miocene (Lukeino Formation, Kenya). Comptes Rendus de l’Académie de Sciences, 332, 137–144. Shankar, S. G, & King, B. J. (2002). The emergence of a new paradigm in ape language research. The Behavioral and Brain Sciences, 25, 605–656. Sugarman, S. (1984). The development of preverbal communication: Its contribution and limits in promoting the development of language. In R. L. Scheifelbush & J. Pickar (Eds.), The acquisition of communicative competence (pp. 23‑67). Baltimore: University Park Press. Tanner, J. E., & Byrne, R. W. (1999). The development of spontaneous gestural communication in a group of zoo-living lowland gorillas. In S. T. Parker, R. W. Mitchell, & H. Lyn Miles (Eds.), The mentalities of gorillas and orangutans: Comparative perspectives (pp. 211–239). Cambridge, U.K.: Cambridge University Press. Thompson, N. S. (1997). Communication and natural design. In D. H. Owings, M. D. Beecher & N. S. Thompson (Eds.), Perspectives in Ethology, Vol. 12 (pp. 391–415). New York: Plenum. Tomasello, M. (1995). Joint attention as social cognition. In C. Moore & P. J. Dunham (Eds.), Joint attention: Its origins and role in development (pp. 103–130). Hillsdale, NJ: Lawrence Erlbaum Associates. Tomasello, M. (1999). The cultural origins of human cognition. Cambridge, MA: Harvard University Press. Tomasello, M. & Call, J. (1997). Primate cognition. Oxford: Oxford University Press. Tomasello, M., Call, J., Nagell, K., Olguin, K. & Carpenter, M. (1994). The learning and use of gestural signals by young chimpanzees: A trans-generational study. Primates, 35, 137–154. Tutin, C. E. G. (1994). Reproductive success story: Variability among chimpanzees and comparisons with gorillas. In R. W. Wrangham, W. C. McGrew, F. B. M. de Waal, & P. G. Heltne (Eds.), Chimpanzee cultures (pp. 181–193). Cambridge, MA: Harvard University Press. van Lawick-Goodall, J. (1968). The behavior of free-living chimpanzees in the Gombe Stream Reserve. Animal Behaviour Monographs, 1, 161–311. Veá, J. J. & Sabater-Pi, J. (1998). Spontaneous pointing behaviour in the wild pygmy chimpanzee (Pan paniscus). Folia Primatologica, 69, 289–290. de Waal, F. B. M. (1982). Chimpanzee politics: Power and sex among apes. New York: Harper & Row. Wilkins, D. (2003). Why pointing with the index finger is not a universal (in sociocultural and semiotic terms). In S. Kita (Ed.), Pointing: Where language, culture, and cognition meet (pp. 171–215). Hillsdale, NJ: Erlbaum.

About the author David A. Leavens. PhD (Psychology) in 2001 from University of Georgia (USA). Worked sporadically at the Yerkes National Primate Research Center (USA) from 1994 to 1999. Since 2000, Lecturer in Psychology and Director of the Infant Study Unit, University of Sussex (UK).

Neandertal vocal tract Which potential for vowel acoustics? Louis-Jean Boë, Jean-Louis Heim, Christian Abry and Pierre Badin Université Stendhal, Institut de la Communication Parlée, Grenoble (Boë, Abry, Badin) / Musée de l’Homme, Paris (Heim)

Potential speech abilities constitute a key component in the description of the Neandertals and their relations with modern Homo Sapiens. Since Lieberman & Crelin postulated in 1971 the theory that “Neanderthal man did not have the anatomical prerequisites for producing the full range of human speech” their speech capability has been a subject of hot debate for over 30 years, and remains a controversial question. In this study, we first question the methodology adopted by Lieberman and Crelin, and we point out articulatory and acoustic flaws in the data and the modeling. Then we propose a general articulatory-acoustic framework for testing the acoustic consequences of the trade-off between oral and pharyngeal cavities. Specifically, following Honda & Tiede (1998), we characterize this trade-off by a Laryngeal Height Index (LHI) corresponding to the length ratio of the pharyngeal cavity to the oral cavity. Using an anthropomorphic articulatory model controlled by lips, jaw, tongue and larynx parameters, we can generate the Maximal Vowel Space (MVS), which is a triangle in the F1 / F2 plane, the three point vowels /a/, /i/, and /u/ being located at its three extremities. We sample the evolution of the position of the larynx from birth to adulthood with four different LHI values, and we show that the associated MVS are very similar. Therefore, the MVS of a given vocal tract does not depend on the LHI: gestures of the tongue body, lips and jaw allow compensations for differences in the ratio between the dimensions of the oral cavity and pharynx. We then infer that the vowel space of Neandertals (with high or low larynx) was potentially no smaller than that of a modern human and that Neandertals could produce all the vowels of the world’s languages. Neandertals were no more vocally handicapped than children at birth are. Therefore, there is no reason to believe that the lowering of the larynx and a concomitant increase in pharynx size are necessary evolutionary pre-adaptations for speech. However, since our study is strictly limited to the morphological and acoustic aspects of the vocal tract, we cannot offer any definitive answer to the question of whether Neandertals could produce human speech or not. Keywords: speech, evolution, Neandertals, morphology, vocal tract

88

Louis-Jean Boë, Jean-Louis Heim, Christian Abry and Pierre Badin

1. A largely widespread but controversial theory Lieberman & Crelin (1971), henceforth L&C, followed up by Lieberman et al. (1972) and Lieberman (1972, 1973, 1984, 1991, 1994) startled paleontologists and anthropologists, as well as the speech science community, when they inferred the shape of the vocal tract from the fossilized skull of La Chapelle-aux-Saints Neandertal, and estimated the extent of his /i/, /a/, and /u/ vowel space. The associated vowel space turned out to be highly reduced with respect to that of a modern human male. They concluded that Neandertals could not have possessed “the full range of human speech” and that this was probably the cause of their mysterious extinction about 30,000 years ago. According to L&C, in order for human speech production to become possible during the course of evolution, it was necessary that the skull be sufficiently flexible (Laitman, 1983; Laitman et al., 1979; Lieberman & McCarthy, 1999), and that the larynx descend to enlarge the pharyngeal cavity for the vowel space to be large enough to realize the contrasts observed in current human vowel. In their revolutionary proposal, L&C grouped chimpanzees, Neandertals, and human newborns in the same class, all having a short pharyngeal cavity relative to the oral cavity. According to L&C, this articulatory configuration prevented them from producing the full range of human speech, namely the /i/, /a/ and /u/ maximal vowel contrasts. Indeed these contrasts are present in a vast majority of the world’s languages (Crothers, 1978; Maddieson, 1986, 1991; Vallée, 1994; Schwartz et al., 1997a). In spite of numerous criticisms formulated by anthropologists (Man & Trinkaus, 1973; Carlisle & Siegel, 1974; Morris, 1974; Falk, 1975; LeMay, 1975; Burr, 1976; Houghton, 1993; Schepartz, 1993; Trinkaus & Shipman, 1993; McCarthy & D. Lieberman, 1997) the thesis proposed by Lieberman and Crelin has been presented as undisputable in numerous publications, encyclopedias, and works of reference (e.g. Chaline, 1994; Leakey, 1994; Lewin, 1984; Lumley, 1998; Reichholf, 1990; Segui & Ferrand, 2000; Shreeves, 1995; Vauclair, 1992). In this article, we first pinpoint that L&C used the unlikely anatomical skull reconstruction established by Boule at the beginning of the 20th century. Next, we demonstrate that L&C’s methodology to infer vocal tract shapes for vowels was inadequate and resulted in a reduced acoustic vowel space. Then, based on standard vocal tract acoustics knowledge and associated simulations tools, we show that it is possible to produce a human-like acoustic vowel space whatever the larynx position. We will note that this implies that in fact the erroneous skull reconstruction should not have prevented L&C from establishing vocal tract shapes that could produce a human-like vowel space.

Neandertal vocal tract

1.1 An unlikely skull reconstruction L&C employed a cast of the skull of La Chapelle-aux-Saints Neandertal that had been reconstructed by Boule (1911–1913, 1921). At the beginning of the 20th century, Boule could only rely on very few elements to reconstruct accurately the skull base. In addition, Neandertals were considered to be closer to the chimpanzee than to modern man and therefore were featured with a forward tilted head position (Figure 1). In this context, Boule assigned a much too anterior location to the head center of gravity, resulting in a much too posterior position of the foramen and of the basion. Much later, Heim (1986, 1989, 1990) proposed a more realistic skull reconstruction of this Neandertal, taking into account the most recent anatomical knowledge (Figure 2). In particular, he found that the Landzert angle that characterizes the skull base was 150° for Boule’s reconstruction, whereas it was 137° for his own reconstruction. This new value is more consistent with the values measured for the Gibraltar Neandertal (142°) and La Ferrassie 1 (135°; Heim, 1976). Recall that the range of the Landzert angle is [126–143°] for the modern Man (Heim, 1989). Heim showed therefore that the shape of the skull base and its position relative to the backbone did not differ from that of Modern man, and that the Neandertal’s larynx was presumably located in the same position as that of modern Man.

Figure 1. The head and neck reconstruction of La Chapelle-aux-Saints Neandertal directed by Boule and realized by the sculptor J. Durand (1921). Note the head forward tilt close to that of chimpanzee.

89

90

Louis-Jean Boë, Jean-Louis Heim, Christian Abry and Pierre Badin

Figure 2. La Chapelle-aux-Saints Neandertal reconstructed by J.L. Heim (1984)

1.2 Unrealistic vocal tract shapes and thus unrealistic vowel triangle Standard vocal tract acoustics knowledge shows that for producing the main extreme point vowels /i a u/, whatever the anatomical size of the vocal tract, whatever the model being used, whether oversimplified or highly sophisticated: – there is only a unique solution to generate a vowel with the highest third resonance as the typical /i/: to control a constriction tube by fronting the grooved tongue against the hard palate; as a direct consequence, the length of the back cavity will always be appropriate to produce a second resonance at a sufficiently high frequency; moreover, this configuration ensures the resonance of the “Helmholtz resonator” formed by the back cavity and the front constriction low enough to be typical of the first resonance of /i/; – the two remaining point vowels can be produced with no difficulty with two Helmholtz resonators for /u/, and with a horn-shaped open vocal tract for /a/. All these resonance frequencies are naturally normalized respective to the frequency of the vocal fold vibration (for babies, adults, men and women) in order for these vowels to be perceived with their proper acoustic qualities. To prove that newborns and Neandertals could not produce the /i/, /a/, and /u/ vowel contrasts, L&C started from reconstructions of the supralaryngeal vocal tract. Then they measured the cross-sectional area at 0.5 cm intervals. These measurements gave a “neutral area function” that they “perturbed towards area

Neandertal vocal tract

/i/

14 12

12

10

10

Area

8

Area

8

(cm )

6

(cm )

6

2

2

4

4

2

2

0 20

15

10

lips       (cm)     glottis

5

0

5

0

/a/

14

/u/

14

0 20

15

10

lips       (cm)     glottis

5

0

12 10

Area

8

(cm )

6

2

4 2 0 20

15

10

lips       (cm)     glottis

Figure 3. Area functions presented by L&C as the “best match” for Neandertal trying to produce the human /i/, /a/, and /u/ contrasts with a small pharyngeal cavity (solid lines). Superimposed the corresponding Fant’s (1960, p. 115) area functions (staircase).

functions that would be reasonable if a newborn or a Neandertal vocal tract attempted to produce the full range of human vowels” (Lieberman & Crelin, 1971), namely the human /i/, /a/, and /u/. In this attempt, they worked with the area functions proposed by Fant (1960). Figure 3 presents Fant’s area functions and those proposed by L&C by decreasing the volume of the pharyngeal cavity and its length. Note that limiting the cross-sectional area of the first 6 centimeters of the vocal tract to an area close to that of the larynx tube, L&C assumed that the tongue root of newborns and Neandertals could not move forward. Note also that they did not take into account the Neandertal prognatism and accordingly did not increase the palatal distance: the resulting overall vocal lengths were therefore limited to a range from 14.2 cm to 16 cm. We calculated the formant frequencies associated with the three area functions proposed by L&C using a frequency domain simulation (Badin & Fant, 1984) (see Table 1). Figure 4 displays, in the F1/F2 space, the L&C Neandertal /i/, /a/, and /u/ triangle, as well — for reference — as the /i/, /æ/, /"/, and /u/ quadrilaterals for males, females and children published by Peterson & Barney (1952). As expected, it appears that the estimated vowel triangle for Neandertal is reduced.

91

Louis-Jean Boë, Jean-Louis Heim, Christian Abry and Pierre Badin

Table 1. Vocal tract lengths and F1-F2 values for /i a u/ proposed by L&C for Neandertal (male adult) /i/

L (cm)

F1 (Hz)

F2 (Hz)

14.4

524

2038

/a/

14.2

846

1650

/u/

16.0

462

807

200 300

i

u

400 500

F1 (Hz)

92

600 700 800 900

æ

1000 1100 3500

3000

2500

" 2000

1500

1000

500

F2 (Hz) Figure 4. The L&C Neandertal /i/, /a/, and /u/ triangle (in black); (in grey) the /i/, /æ/, /"/, and /u/ quadrilaterals for adult male (upper right), adult female (middle), and children (lower left) (from Peterson & Barney, 1952).

It so happens that vowel /a/ is the only realistic vowel, as it is naturally characterized by a pharyngeal constriction. As a matter of fact, its area function matches roughly the one proposed by Fant. Due to its reduced length, it is typical of a female /a/ vowel. The small dimensions of the triangle obtained by L&C are due to the fact that they modified the three area functions without taking into account the basic knowledge of the acoustic theory of speech production (Fant, 1960). In fact their strategy for establishing the area functions optimally matching the /i/, /a/, and /u/ curiously resulted in the inverse effect on F1:

Neandertal vocal tract

– for /i/: they should have increased the constriction length and decreased the constriction area; – for /u/: they could have increased lip protrusion and decreased lip area in order to lower F1; the too low value of F2 is due to a too narrow velar constriction. In conclusion, the area functions proposed by L&C produce vowels corresponding to /6/ and /o/, rather than the extreme point vowels /i/ and /u/. We show in the next section that it is indeed possible to find area functions producing /i a u/ and corresponding to plausible vocal tracts whatever the larynx position.

2. A flexible articulatory-acoustic vocal tract model The ratio between the oral and pharyngeal cavity lengths is — in addition to vocal tract overall length — one of the important factors to characterize globally the vocal tract morphology. This is why Honda & Tiede (1998) proposed an index based on palatal distance (PD) and laryngeal height (LH). PD is the distance between the anterior nasal spine (ANS) and the posterior nasopharyngeal wall (PNW). PNW is defined as the intersection point of a standard palatal line (specified by ANP and the posterior nasal spine) with the posterior nasopharyngeal wall. LH is the distance from the arytenoid apex to the palatal line. LH, therefore, represents the pharyngeal cavity length and thus the vertical position of the larynx. LHI = LH/PD is thus an interesting articulatory parameter, as it reflects growth differences from birth to adulthood and gender differences, as well as differences between modern Man, Neandertal and chimpanzee. In order to assess the effects of this LHI factor on the vowels, we used a vocal tract model that we had already designed to study human growth. It seemed appropriate to study the acoustic correlates of the trade-off between pharyngeal and oral cavity dimensions as specified by the LHI parameter (Boë & Maeda, 1998; Ménard & Boë, 2000; Ménard, 2002; Ménard et al., 2002; Serkhane et al., 2002). At birth, the heads of newborns are approximately hemispherical in shape. Increases in the volume and shape of the skull and of the size of the inferior maxilla modify the relative proportions of horizontal and vertical dimensions. The process does not therefore involve a simple uniform scaling, but rather an anamorphosis in which the vertical dimension is emphasized. For the vocal tract, this phenomenon is further accentuated by the lowering of the larynx. The growth of the pharynx is therefore approximately twice as large as that of the front cavity.

93

94

Louis-Jean Boë, Jean-Louis Heim, Christian Abry and Pierre Badin

Vocal tract growth simulation has thus been implemented in the Variable Linear Articulatory Model (VLAM) by Maeda (cf. Boë & Maeda, 1998). This model is based on Maeda’s anthropomorphic linear articulatory model proposed in 1989. The basic assumption of the model is that any vocal tract midsagittal contour can be decomposed into a weighted sum of linear components representing the effects of individual articulator parameters. Note in particular that any tongue shape (from the apex to the root) can be adequately described by four components consisting of an extrinsic parameter, jaw position, and of three intrinsic parameters, tongue-body position, tongue-body shape and tongue-tip position. The lip horn is represented by a uniform tube controlled by two parameters (lip height and lip protrusion) in addition to the jaw position effect. Finally, the laryngeal region is controlled by a parameter that sets the larynx height. Note that the upper incisors, the hard palate, and the posterior pharyngeal wall are represented by a fixed exterior outline. The vocal tract contours are in a semi-polar coordinate system. The growth process is implemented in our articulatory model by modifying the longitudinal dimension of the vocal tract according to two scaling factors applied to three regions: (1) the anterior part of the vocal tract using the first scaling factor kmouth, (2) the pharynx scaled by the second factor kpharynx, and (3) an intermediate region scaled by a factor continuously interpolated between the two fixed factors. We made sure that length variations in the midsagittal dimension corresponded to data available in the literature (Kasuya, et al., 1968; Goldstein, 1980; Yang & Kasuya, 1994; Story et al., 1996; White, 1998; Fitch & Giedd, 1999). In particular, we have verified (Boë et al., 2002; Ménard, 2000) that VLAM predicts vocal tract lengths compatible with the limits established by Fitch & Giedd (1999). This assessment is of great importance, as vocal tract length determines the absolute position (along the frequency axis) in the maximal acoustic vowel F1/F2/F3 space. Our vocal tract model does not allow the recovery of the same landmarks as those used by Honda & Tiede (1998). Therefore, we defined an LHI ratio adapted to our model. To estimate the palatal distance PD, we used the incisors (instead of the anterior nasal spine), and the top of the pharyngeal wall, and larynx height LH was estimated in reference to glottis position (instead of arytenoid apex position) (Figure 5). Table 2 displays the values of the vocal tract length, of the larynx height and palatal distance, and of the larynx height index for a male adult and a newborn models.

Neandertal vocal tract

Figure 5. The landmarks used to calculate palatal distance PD, larynx height LH, hence larynx height index LHI = LH/PD. Table 2. The values of vertical larynx height (LH), horizontal palatal distance (PD), larynx height index (LHI), and length of the vocal tract calculated with the articulatory model (VLAM) LH (cm)

PD (cm)

LHI

Length (cm)

21-year-old male

8.70

8.70

1.00

17.45

0-year-old newborn

2.63

4.34

0.60

7.70

3. Maximal Vowel Space and the point vowels /i a u/ We can define the Maximal Vowel Space (MVS) of a given speaker as the n‑dimensional space of the first n‑formants of all possible vocalic sounds that can be produced by that speaker. Obviously, the use of an articulatory model allows an exhaustive determination of the borders of that space, and in particular the point vowels /i a u/.

95

96

Louis-Jean Boë, Jean-Louis Heim, Christian Abry and Pierre Badin

This MVS represents a global acoustic characterization of a given vocal tract, and can thus be used to compare the speech capabilities from birth to adulthood. 3.1 The Maximal Vowel Space and compensation phenomena It is possible to produce the maximal F1-F2-F3 acoustic space of a model by scanning the entire input space of each articulatory control parameter, while satisfying the conditions necessary for vocalic production. The fact that all vowels are necessarily located in the F1/F2 triangle is due to basic acoustic properties of a single tract (Bonder, 1982; Boë et al., 1989). Determining the MVS in such a way is more reliable than simply using three unique examples of the point vowels /i/, /a/, and /u/ which are not guaranteed to be located at the real limits of the triangle. When the number of control parameters is larger than the number of output parameters, a given output could be then generated by a manifold in the input space, i.e. a continuous subset of the input space. In the case of our articulatory model with seven control parameters, a five dimensional manifold can be associated to any given point in the F1/F2 acoustic two-dimensional space. This property, well known and used in robotics, has been pinpointed for speech by Atal et al. (1978). Thus, it possible to explore the potential compensation strategies that are related to a given acoustic product. Figure 6 shows a simplified representation of a manifold in a 3D control space associated with a single point in the F1/F2 space.

F1 P3

P2 P1

F2

Figure 6. Example of the acoustic Maximal Vowel Space generated by exploring the whole articulatory parameter (P1–P2–P3) space corresponding to a hypercube. The acoustic output is the well known F1/F2 triangle. The manifold in the input space associated with a single point in the acoustic output space is depicted as a surface in this example.

Neandertal vocal tract

3.2 Prototypes for /i a u/ in the MVS The MVS allows then a precise description of all the possibilities for maximal distinctiveness and permits an optimal choice of prototypical realizations. If one considers, following Liljencrants & Lindblom (1974), that the vowels /i a u/ are located within that space in such a way as to maximize the distances between vowels, it is possible to characterize in the F1-F2-F3 acoustic space the three point vowels in the following way (Schwartz et al., 1997b): – The vowel /i/ is characterized by a maximal F3, resulting in a minimum F1 and a maximal F2; – The vowel /a/ corresponds to a maximal F1; – The vowel /u/ is produced with a minimal F2, resulting in a minimum F1 value. In a systematic study, Ménard (2002) has generated Maximal Vowel Space and selected vocalic prototypes for various age values using VLAM. The resulting formant values were found to be consistent with the data available in the literature (Peterson & Barney, 1952; Fant, 1973; Eguchi & Hirsh, 1969; Hillenbrand et al. 1995; Lee et al., 1999; Huber et al., 1999).

4. Influence of the Larynx Height Index on the Maximal Vowel Space The main question motivating this study was to establish the influence of the Larynx Height Index on the Maximal Vowel Space. Therefore, articulatory and acoustic vocal tract simulations were carried out to generate MVS for four different values of LHI (from 1.0 to 0.60). In order to overcome the acoustic normalization problem related to resulting length variation, the vocal tract length was maintained constant (17.4 cm for the neutral articulation) by choosing the appropriate kmouth and kpharynx factors for each value of LHI (Figure 7). We determined the input space by generating 20,000 vocalic articulations using a random uniform distribution for each of the seven articulatory parameters. We imposed minimum thresholds of 0.3 cm2 (cf. e.g., Fant, 1960; Catford, 1977) for constriction area which is the standard value discussed above. For the lips, this threshold was lowered to 0.1 cm2, a value commonly observed during speech production, especially for closed vowels such as /u/ and /y/ in French (Abry & Boë, 1986). The minimum cross-sectional area requirement, therefore, constitutes the necessary condition for a specified vocal-tract configuration to be considered as that of a vowel. Formants were finally computed from the generated area functions

97

98

Louis-Jean Boë, Jean-Louis Heim, Christian Abry and Pierre Badin

LHI = 1.00

LHI = 0.80

LHI = 0.70

LHI = 0.60

Figure 7. Neutral position of the vocal tract for four values of LHI (constant length of 17.4 cm).

using a frequency domain acoustic simulation of the vocal tract (Badin & Fant, 1984). The resulting MVS are displayed in Figure 8 for the four LHI values. These results clearly show that, no matter what the ratio between oral and pharyngeal cavities is, vowel spaces are rather similar. This can be explained by the possibility of compensations, as will be shown.

5. Proposals for /i/, /a/, and /u/ prototypes for different Larynx Height Indexes Now, we are able to propose /i/, /a/, and u/ prototypes for the two extremes LHI (1.00 and 0.60). Therefore, we selected the representative point vowels in each MVS according to criteria defined above, and determined the associated articulatory

Neandertal vocal tract

LHI = 0.8

LHI = 1

F1 (Hz)

200

200

300

300

400

400

F1 (Hz)

500

500

600

600

700

700

800

2500

2000

1500

F2 (Hz)

1000

800

500

2500

LHI = 0.7

F1 (Hz)

200

300

300

400

400

F1 (Hz)

500

600

700

700 2000

1500

F2 (Hz)

1000

500

1000

500

500

600

2500

1500

F2 (Hz)

LHI = 0.6

200

800

2000

1000

500

800

2500

2000

1500

F2 (Hz)

Figure 8. Maximal vowel spaces for four values of LHI (1.00, 0.80, 0.70, 0.60).

control parameters, and thus the corresponding midsagittal sections and area functions (Figure 9 & 10). What are the differences that can be observed between the articulatory strategies used to produce the acoustic outputs with the two extremes LHI? For /i/, achieving approximately the same acoustic goals with a small pharyngeal cavity requires (i) a narrower constriction, (ii) a fronting of the tongue body. For /u/, a small pharyngeal cavity requires (i) a more anterior tongue body, and (ii) an increased lip protrusion. This is in agreement with Fant (1975) who reports, from cineradiographic data, that a female subject, compared to males, produces a more fronted /u/. Finally, /a/ is produced with quite the same strategy in the two cases.

99

100

Louis-Jean Boë, Jean-Louis Heim, Christian Abry and Pierre Badin

LHI = 1.00

/a/

/i/

/u/

LHI = 0.60

/a/

/i/

/u/

Figure 9. Midsagittal contours of /i/, /a/, and u/ prototypes, for LHI = 1.00 and 0.60. LHI = 1.00

14

14

12

12

14 12

10

10

10

8

8

8

(cm 2) 6

6

6

4

4

4

2

2

2

0 20

0 20

area

10 lips (cm) glottis

0

10 lips (cm) glottis

0

LHI = 0.60

0 20

14

14

12

12

12

10

10

10

8

8

8

6

6

4

4

4

2

2

2

0 20

0 20

10 lips (cm) glottis

0

10 lips (cm) glottis

0

10 lips (cm) glottis

0

14

(cm 2) 6

area

10 lips (cm) glottis

0

0 20

Figure 10. Area functions of /i/, /a/, and u/ prototypes, for LHI = 1.00 and 0.60.

Neandertal vocal tract

200

300

400

F1 (Hz)

500

600

700

800

2400

2200

2000

1800

1600

F2 (Hz)

1400

1200

1000

800

Figure 11. The two /i/, /a/, and /u/ triangle corresponding to the extreme values of LHI (o : 1.00 and • : 0.60) ; the /i/, /æ/, /"/, and /u/ quadrilateral for adult male from Peterson & Barney (1952) ( ).

In order to verify the consistency of the selected /i/, /a/, and u/ prototypes, we have plotted them in Figure 11, together with the Peterson & Barney (1952) data. We observe that they match rather well with male adults data, as expected.

6. Discussion and conclusion We have shown that modeling the growth of the vocal tract enables a better understanding of the phenomena governing anatomical differences between neonates, babies, adolescents, and male and female adults as quantified by a Larynx Height Index. This approach allows discussion of the consequences of variation in vocal tract dimensions during evolution with the aim of establishing distinctive sounds for speech. Our simulations show that the Maximal Vowel Space of a given vocal tract does not actually depend on the Larynx Height Index: gestures of the tongue body, lips and jaw allow compensations for differences in the ratio between the dimensions of the oral cavity and pharynx. This is fully consistent with the existing data

101

102

Louis-Jean Boë, Jean-Louis Heim, Christian Abry and Pierre Badin

that have been collected: as far as we know, nobody to date has claimed that adolescents and women — who have shorter pharynges than men — have more difficulty than men in realizing vowel contrasts. These results confirm the conclusions of Goldstein for newborns (1980 p. 214): “Since the vowels /i/, /"/, and /u/ can be synthesized by a model which is anatomically correct for infants, one can postulate that newborns are not prevented from speaking because of the anatomy of their vocal tracts”. Recently, Meltzoff (2000) showed evidence for the antero-posterior mobility of the pharynx of the newborn: a few hours after birth, newborns can perform tongue protrusion by imitation of an adult, thus moving their tongue roots forward. In 1971, when they estimated very small cross-sectional areas for the pharynx cavity for /i/ and /u/ in the newborn, these findings were not available to L&C. As for chimpanzees tongue root mobility, Takemoto (2000) performed comparative microscopic and macroscopic analyses of the human and chimpanzee tongue musculature. His dissections revealed that the human and chimpanzee tongues share a topologically identical musculature. If this result is confirmed, the inability of chimpanzees to produce speech would then have to be ascribed to limitations in higher neural levels. If Neandertals could not produce maximal vowel contrasts, it is unlikely that it was for the articulatory acoustic reasons advocated by L&C. However, since our study is strictly limited to the morphological and acoustic aspects of the vocal tract, we cannot offer any definitive answer to the question of whether Neandertals could produce human speech or not. Finally, a low larynx and a large pharynx cannot be considered to be the “anatomical prerequisites for producing the full range of human speech” and for constituting a necessary evolutionary pre-adaptation to speech.

Acknowledgements This research was funded by the CNRS program Origine de l’Homme du Langage et des Langues, managed by Jean-Marie Hombert; it is an ongoing research in the OMLL program (Origin of Man Language and Languages) funded by ESF-EUROCORES, Orofacial control in communication in human and non human primates project (managed by Jean-Luc Schwartz).

References Abry, C. & Boë, L. J. (1986). “Laws” for lips, Speech Communication, 5, 97–104. Atal, B. S., Chang, J. J., Mathews, M. V. & Tukey, J. W. (1978). Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer sorting technique. Journal of the Acoustical Society of America, 63, 1535–1555.

Neandertal vocal tract

Badin, P. & Fant, G. (1984). Notes on vocal tract computations, STL-QPSR, 2–3, 53–108. Boë, L.-J. (1999). Modeling the growth of the vocal tract vowel spaces of newly-born infants and adults: Consequences for ontogenesis and phylogenesis. In Proceedings of the 14th International Congress of Phonetic Sciences, 3, pp. 2501–2504. Boë, L. J., Heim, J. L., Honda, K. & Maeda, S. (2002). The potential of Neandertal vowel space was as large as that of modern humans. Journal of Phonetics, 30, 465–484. Boë, L.-J. & Maeda, S. (1998). Modélisation de la croissance du conduit vocal, Journées d’Études Linguistiques, La Voyelle dans tous ses états, Nantes, 98–105. Boë, L.-J., Perrier, P., Guerin, B. & Schwartz, J.-L. (1989). Maximal vowel space, In Proceedings of Eurospeech, 2, 281–284. Bonder, L. J. (1982). The n-tube formula and some of its consequences. Acoustica 52, 216–226. Boule, M. (1911–1913). L’Homme fossile de La Chapelle-aux-Saints, Annales de Paléontologie humaine, Paris: Masson. Boule, M. (1921). Les hommes fossiles. Éléments de paléontologie humaine, 6, 111–172; 7, 21–56, 85–172; 8, 1–70. Paris: Masson. Burr, D. (1976). Further evidence concerning speech in Neandertal man. Journal of the Royal Anthropological Institute (MAN), 11, 1, 104–110. Chaline, J. (1994). Une famille peu ordinaire. Du singe à l’homme. Paris: Seuil. Carlisle, R. C. & Siegel, M. I. (1974). Some problems in the interpretation of Neanderthal capabilities: A reply to Lieberman and Crelin. American Anthropologist, 76, 319–322. Catford, J. C. (1977). Fundamental problems in phonetics. Edinburgh: Edinburgh University Press. Crothers, J. (1978). Typology and Universals of Vowel Systems. In Universals of Human Language, pp. 93–152. J. H. Greenberg, ed. Stanford: Stanford Univ. Press. Eguchi, S. & Hirsh, I. J. (1969). Development of speech sounds in children. Acta OtoLaryngol. Suppl. 257, 5–51. Falk, D. (1975). Comparative anatomy of the larynx in man and the chimpanzee: implications for language in Neanderthal, American, Journal of Physical Anthropology 43, 1, 123–32. Fant, G. (1960). Acoustic theory of speech production. The Hague: Mouton. Fant, G. (1975). Non-uniform vowel normalization. STL QPSR, 2–3, 1–9. Fitch, W. T. & Giedd, J. (1999). Morphology and development of the human vocal tract: A study using magnetic resonance imaging, Journal of the Acoustical Society of America, 106, 3, 1511–1522. Goldstein, U. G. (1980). An articulatory model for the vocal tract of the growing children. Thesis of Doctor of Science, MIT, Cambridge, Massachusetts. http://theses.mit.edu/. Heim, J.-L. (1976). Les Hommes fossiles de La Ferrassie. Tome I. Le gisement, les squelettes adultes : crâne et squelette du tronc, Archives de l’Institut de Paléontologie Humaine, 35. Paris: Masson. Heim, J.-L. (1986). La reconstitution du crâne de la Chapelle-aux-Saints, Film 16mm. Paris: Éditions SFRS. Heim, J.-L. (1989). Une nouvelle reconstitution du crâne néandertalien de la Chapelle-auxSaints, Compte Rendu de l’Académie des Sciences de Paris, 308, II, 6: 1187–1192. Heim J.-L. (1990). La nouvelle reconstitution du crâne néandertalien de la Chapelle‑aux‑Saints. Méthode et résultats, Bulletin et Mémoires de la Société d’Anthropologie de Paris, 6, 1–2, 94–117. Hillenbrand, J., Getty, L. A., Clark, M. J. & Wheeler, K. (1995). Acoustic characteristics of American English vowels, Journal of the Acoustical Society of America, 97, 3099–3111.

103

104

Louis-Jean Boë, Jean-Louis Heim, Christian Abry and Pierre Badin

Honda, K. & Tiede, M. K. (1998). An MRI study on the relationship between oral cavity shape and larynx position, In Proceedings of the 5th International Conference on Spoken Language Processing, 2, 437–440. Houghton, P. (1993). Neandertal supralaryngeal vocal tract, American Journal of Physical Anthropology, 90, 139–146. Huber, J. E., Stathopoulos, E. T., Curione, G. M., Ash, T. A. & Johnson, K. (1999). Formants of children, women, and men: The effects of vocal intensity variation. Journal of the Acoustical Society of America, 106, 1532–1542. Kasuya, H., Suzuhi, H. & Kido, K. (1968). Changes in pitch and first three formants of five Japanese vowels with age and sex speakers, Journal of the Acoustical Society of Japan (in Japanese) 24, 355–364. Kuhl, P. & Meltzoff, A. N. (1996). Infant vocalizations in response to speech: Vocal imitation and developmental change, Journal of the Acoustical Society of America, 100, 2425–2438. Laitman, J. T. (1983). The evolution of hominid upper respiratory system and implication for the origin of speech, Glossogenetics, 63–90. Harwood Academic publishers. Laitman, J. T., Heimbuch, R. C. & Crelin, E. S. (1979). The basicranium of fossil hominids as an indicator of their upper respiratory systems, American Journal of Physical Anthropology 51, 15–34. Leakey, R. (1994). The origin of humankind. Orion Pub. Group Ltd. Lee, S., Potamianos, A. & Narayanan, S. (1999). Acoustics of children’s speech: Developmental changes of temporal and spectral parameters, Journal of the Acoustical Society of America, 105, 1455–1468. Lewin, R. (1984). Human evolution. Oxford: Blackwell Scientific Pub. LeMay, M. (1975). The language capability of Neanderthal man. American Journal of Physical Anthropology, 42, 9–14. Lieberman, D. E. & McCarthy, R. C. (1999). The ontogeny of cranial base angulation in humans and chimpanzees and its implications for reconstructing pharyngeal dimensions. Journal of Human Evolution, 36, 5, 487–517. Lieberman, Ph. (1972). The speech of primates. The Hague: Mouton. Lieberman, Ph. (1973). On the evolution of language: A unified view, Cognition, 2, 59–94. Lieberman, Ph. (1984). The biology and evolution of language. Cambridge, Mass: MIT Press. Lieberman, Ph. (1991). Uniquely human. The evolution of speech, thought, and selfless behaviour. Cambridge, MA: Harvard University Press. Lieberman, Ph. (1994). Hyoid bone position and speech: A reply to Dr. Arensburg et al. (1990). American Journal of Physical Anthropology, 94, 275–278. Lieberman, Ph. & Crelin, E. S. (1971). On the speech of the Neanderthal man, Linguistic Inquiry, 2, 203–222. Lieberman, Ph., Crelin, E. & Klatt, D. (1972). Phonetic ability and related anatomy of the newborn and adult man, Neanderthal man and the chimpanzee. Amer. Anthropologist, 74, 2, 287–307. Liljencrants, J. & Lindblom B. (1972). Numerical simulations of vowel quality systems: The role of perceptual contrasts, Language, 48, 839–862. Lumley De, H. (1998). L’Homme premier. Préhistoire, évolution, culture. Paris: Odile Jacob. Maddieson, I. (1986/1984). Pattern of sounds, 2nd edition. Cambridge, MA: Cambridge University Press.

Neandertal vocal tract

Maddieson, I. (1991). Testing the universality of phonological generalizations with a phonetically specified segment database: Results and limitations. Phonetica, 48, 193–206. Maeda, S. (1989). Compensatory articulation during speech: evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model. In W. J. Hardcastle & A. Marchal (Eds.), Speech Production and Modelling, pp. 131–149. Dordrecht: Kluwer. Man, A. & Trinkaus, E. (1973). Neandertal and Neandertal-like fossils from the Upper Pleistocene. Yearbook of Physical Anthropology, 17, 169–193. McCarthy, R. C. & Lieberman, D. E. (1997). Reconstructing vocal tract dimensions from cranial base flexion: An ontogenetic comparison of cranial base angulation in humans and chimpanzees. American Journal of Physical Anthropology Suppl., 24, 163–164. Meltzoff, A. N. (2000). Newborn imitation. In. Min, D. & Blater, A. (Eds.). Infant development, the essential readings (pp. 165–181), Oxford: Blackwell. Ménard, L. (2002). Production et perception des voyelles au cours de la croissance du conduit vocal : variabilité, invariance et normalisation. Doctorat Sciences du Langage. Université de de Grenoble III. Ménard, L. & Boë, L.-J. (2000). Exploring vowel production strategies from infant to adult by means of articulatory inversion of formant data. Proceedings of the International Congress of Spoken Language Processing, Beijing (China), 465–468. Ménard, L., Schwartz, J. L., Boë, L. J., Kandel, S. & Vallée, N. (2002). Auditory normalization of French vowels synthesized by an articulatory model simulating growth from birth to adulthood. Journal of the Acoustical Society of America, 111 (4), 1992–1905. Morris, D. H. (1974). Neanderthal speech. Linguistic Inquiry, 5, 144–150. Peterson G. E. & Barney, H. L., (1952). Control methods used in a study of vowels. Journal of the Acoustical Society of America, 24, 175–184. Reichholf, J. H. (1990). Das Rästel der Menschwerdung. Die Enstehung des Menschen in Wechselspiel mit der Natur. Munchen: Deutscher Taschenbuch Verlag. Schepartz, L. A. (1993). Language and modern human origins. Yearbook of Physical Anthropology, 36, 91–126. Schwartz J. L., Boë L. J., Vallée N. & Abry C. (1997a). Major trends in vowel system inventories, Journal of Phonetics, 25, 233–253. Schwartz J. L., Boë L. J., Vallée N. & Abry C. (1997b). The Dispersion-Focalization Theory of vowel systems, Journal of Phonetics, 25, 255–286. Segui, J. & Ferrand, L. (2000). Leçons de parole. Paris: Odile Jacob. Serkhane, J., Schwartz, J. L., Boë, L. J., Davis, B. & Matyear, C. (2002). Motor specifications of a baby robot via the analysis of infants’ vocalizations. ICSLP, Denver. Shreeve, J. (1995). The Neandertal enigma. London: Penguin books. Story, B. H., Titze, I. R. & Hoffman, E. A. (1996). Vocal tract area functions from magnetic resonance imaging. Journal of the Acoustical Society of America, 100, 537–554. Takemoto, H. (2000). Morphological analysis and 3D modeling of the tongue musculature in the human chimpanzee. 5th Seminar on Speech Production: Models and data, 361. Kloster Seeon, Bavaria. Trinkaus, E. & Shipman, P. (1993). The Neandertals. Changing the image of mankind. New York: Alfred A. Knopf. Vallée, N. (1994). Systèmes vocaliques : de la typologie aux prédictions, Doctoral thesis, Sciences du Langage, Université Stendhal: Grenoble, France. Vauclair, J. (1992). L’intelligence de l’animal. Paris: Seuil.

105

106

Louis-Jean Boë, Jean-Louis Heim, Christian Abry and Pierre Badin

White, P. (1998). Formant frequency analysis of children’s spoken and sung vowels using sweeping fundamental frequency production, Speech, Music and Hearing, Quarterly Progress and Status Report, TMH-QPSR, 1–2, 43–52. Yang, C.-S. & Kasuya, H. (1994). Accurate measurement of vocal tract shapes from magnetic resonances images of child, female and male subjects, In Proceedings of the International Congress of Spoken Language, Yokohama (Japan), 623–626.

About the first author Louis-Jean Boë is an engineer-researcher, he has a Ph.D. in electronics, and is also a phonetician. He is a senior researcher at the Institute of Speech Communication (ICP) at the University of Grenoble. He is also an associated researcher at the Musée National d’Histoire Naturelle in Paris. His main interests lie in language sound structures in relationship with ontogenesis and phylogenesis, physical anthropology, history of speech sciences and deontological problems of forensic applications of phonetics. He is a member of the Société de Linguistique de Paris, the Société de Biométrie Française, and the Association Francophone de la Communication Parlée. Now at GIPSA-LAB.

Interweaving protosign and protospeech Further developments beyond the mirror Michael A. Arbib Computer Science, Neuroscience and USC Brain Project, University of Southern California

We distinguish “language readiness” (biological) from “having language” (cultural) and outline a hypothesis for the evolution of the language-ready brain and language involving seven stages: S1: grasping; S2: a mirror system for grasping; S3: a simple imitation system for grasping, shared with the common ancestor of human and chimpanzee; S4: a complex imitation system for grasping; S5: protosign, breaking through the fixed repertoire of primate vocalizations to yield an open repertoire for communication; S6: protospeech, the open-ended production and perception of sequences of vocal gestures, without these sequences constituting a full language; and S7: a process of cultural evolution in Homo sapiens yielding full human languages. The present paper will examine the subhypothesis that protosign (S5) formed a scaffolding for protospeech (S6), but that the two interacted with each other in supporting the evolution of brain and body that made Homo sapiens “language-ready”. Keywords: language evolution, mirror system, protosign, protospeech

Language: What is to be explained? Much work in syntax (e.g., Chomsky’s Minimalism) seeks to characterize the set of well-formed sentences of a given language within a more general framework of seeking a Universal Grammar which classifies just which grammars qualify a set of strings of words as a human language. However, my concern is with the evolution of those brain mechanisms which support language, and thus will focus on a performance viewpoint which stresses the dynamics of perception and production. Moreover, I see both speech and signing as expressions of language, but will not consider here the specialized skills associated with reading and writing. Although, I will not do full justice to them, my framework will be provided by three levels (Figure 1):

108

Michael A. Arbib

1. Cognitive Structures (Schema Assemblages) 2. Semantic Structures (Hierarchical Constituents expressing objects, actions and relationships) 3. “Phonological” Structures (Ordered Expressive Gestures) with production proceeding 1→2→3 and perception proceeding 3→2→1. Note that (3) has been formulated so as to accommodate the “phonology” of both spoken and signed languages. I will argue that a key transition in “brain power” was the ability to recognize that a novel action was in fact composed of (approximations to) known actions. This is crucial not only to the child’s capacity for what I call “complex imitation” and the ability to acquire language and social skills, but is also essential to the adult use of language. In both speech and signing, we recognize a novel utterance as in fact composed of (approximations to) known actions, namely the uttering of words. Just as crucially, the stock of words is open-ended. In Figure 1, semantic form acts as an intermediary between the language-specific phonological form and the “almost” language-independent cognitive form. In the work sampled in this paper, I have not treated the cognitive form for abstract expressions but have rather thought through some of the issues of linking language to the here-andnow of a particular situation, as in describing the present scene, or answering questions about it.

Cognitive Structures (Schema Assemblages)

Perception:

Production: Praxis & Language

Patterns of motor control

Semantic Structures (Hierarchical Constituents expressing objects, actions and relationships)

“Phonological” Structures (Ordered Expressive Gestures)

Praxis & Language

Image features, salient objects, and salient motion

Figure 1. A view of language production and perception as the linkage of Cognitive Form (CF), Semantic Form (SF), and Phonological Form (PF), where the phonology may involve vocal or manual gestures, or just one of these, with or without the accompaniment of facial gestures.

Interweaving protosign and protospeech

What would it mean to explain the evolution of human language? It has often been observed that the human archeological record shows little trace of art, burial rituals and other “modern” developments of human culture (as distinct from, say, hunting practices, tool making, and the use of fire) before about 50,000 years ago, and some have argued that this apparent transition may have occurred with the “discovery” of language. On this view, the brain of early Homo sapiens was “language ready” in the sense that once history had led to the invention of the appropriate technology through the accumulation of many acts of genius (cf. Tomasello, 1999), human brains had no problem acquiring the skills needed to master that technology. On this basis, I phrase the pertinent questions for the study of the evolution of human language as follows: The biological question (“biological evolution”): How did humans evolve a “language-ready” brain (and body)? This raises the ancillary questions: What is it about the human brain and body that allows humans to acquire languages of a richness that seems to be denied to all other animals? How did it get to be that way? The attributes of the human genome that allow human children to readily acquire language may not be the result of evolutionary pressures directly related to language. For example, today’s children can easily acquire the skills of Web surfing and videogame playing, but there are no genes specific to these skills. The historical question (“cultural evolution”): What historical developments led to the wide variety of languages seen today? If one accepts, as I do, the view that the earliest Homo sapiens did not have language in the modern sense, then one must include in this task an analysis of how language may have emerged from some earlier form of communication. I will use the term protolanguage for such earlier forms, leaving it till later in the article to discuss contrasting views of what the structure of protolanguage might have been. The biological and historical questions come together in: The developmental question: What is the social structure that brings the child into the community using a specific language, and what neural capabilities are needed in the brains of the child and its caregivers to support this process? My focus in the present article will be primarily on the biological question, mediating between two contrasting views of the role of manual communication in the evolution of language-readiness: (H1) Language evolved directly as speech (MacNeilage, 1998; MacNeilage and Davis, 2004a); (H2) Language evolved first as sign language (i.e., a full language, not a protolanguage) and then speech emerged from this basis in manual communication (Stokoe, 2001).

109

110

Michael A. Arbib

I shall instead introduce the notion of “protosign” (protolanguage communicated primarily by manual and facial gestures) and “protospeech” (protolanguage communicated primarily by vocal gestures) and then argue for: (H3) The Doctrine of the Expanding Spiral: Our distant ancestors (e.g., Homo habilis through to early Homo sapiens) had “protosign” which — contra (H1) — provided essential scaffolding for the emergence of “protospeech”, but biological and cultural evolution along the hominid line saw advances in both protosign and protospeech feeding off each other in an expanding spiral so that — contra (H2) — protosign did not attain the status of a full language prior to the emergence of early forms of protospeech. My argument will be based on the Mirror System Hypothesis of Arbib and Rizzolatti (1997; Rizzolatti & Arbib, 1998). It should be noted that our original papers could be read as arguing for view (H2) above, that “language evolved first as sign language and then speech emerged from this basis in manual communication”, but the version presented here (continuing developments reported in Arbib, 2002; see also Arbib, 2005) will be tuned to view (H3), the expanding spiral of protosign and protospeech.1 Before proceeding to an exposition of the Mirror System Hypothesis, let me first clarify the notion of protolanguage. As the term is used here, a protolanguage is a system of utterances used by a particular prehominid species (possibly including early Homo sapiens) which we may recognize as a precursor to human language in the modern sense. A particular view of the structure of protolanguage has been advocated by Bickerton (1995) who defines a protolanguage to be made up of utterances comprising a few words in the current sense without syntactic structure. For Bickerton, infant language, pidgins, and the “language” taught to apes are all protolanguages in his sense. Bickerton’s hypothesis on language evolution was, then, that the protolanguage of Homo erectus was a protolanguage in his sense. Language just “added syntax” through the evolution of Universal Grammar. My counter-claim is that the first Homo sapiens had a “language-ready brain” but did not have language, and that their protolanguage was not a protolanguage in Bickerton’s sense. Indeed, I have argued that the protolanguage of Homo erectus and early Homo sapiens was composed mainly of “unitary utterances” (see the argument, commentaries pro and con, and my response in Arbib, 2005). By this I mean that a single “protoword” would be used to refer to a familiar event. Such a “protoword” or “unitary utterance” might need several words for its translation into English, yet would have no parts that would correspond to words in this sense. On this view (cf. Wray, 1998), words co-evolved culturally with syntax through fractionation — i.e., as protowords came to be replaced by combinations

Interweaving protosign and protospeech

of semantically smaller units, so too did rules emerge to clarify the structure and hence the meaning of such combination. However, this hypothesis is of limited relevance to our central concern with the putative interaction of protosign and protospeech and so will not be developed further in this article. My only point here is to insist that Bickerton’s influential view of protolanguage is not necessarily correct (though its possible correctness has not been ruled out).

The mirror system approach to the evolution of human language The companion article by Fogassi and Ferrari (2004) reviews basic data on the mirror system in area F5 of macaque premotor cortex which links the observation and execution of manual actions.2 Rizzolatti & Arbib (1998; see further Arbib & Bota, 2003) review evidence that monkey F5 (with its mirror system for grasping) is homologous to human Broca’s area. This grounds: The Mirror System Hypothesis: The parity requirement for language in humans — that what counts for the speaker must count approximately the same for the hearer3 — is met because Broca’s area evolved atop the mirror system for grasping with its capacity to generate and recognize a set of actions.

Much work in aphasiology (see, e.g., Benson & Ardila, 1996) suggests that the localization of the production of word sounds is probably in Broca’s area, while the localization of the recognition of word sounds is, more likely, in “Wernicke’s” area — suggesting that parity of linguistic forms between production and perception does not require identify of the neural areas supporting production and perception of those forms. However (see Arbib [2004] in response to Hurford [2004]), one must distinguish the mirror system for the sign (phonological form) from the linkage of the sign to the neural schema for the signified. One should also note that, although the original formulation of the Mirror System Hypothesis is Broca’scentric, Arbib and Bota (2003) do stress that interactions between parietal (PF), temporal (STS) and premotor (F5) areas in the monkey provide an evolutionary basis for the integration of Wernicke’s area, STS and Broca’s area in the human. On this view, Broca’s area becomes the meeting place for phonological perception and production, but other areas are required to link phonological form (in the general sense explained after Figure 1) to semantic form. However, whether in its original form, or expanded as in Arbib and Bota (2003), the Mirror System Hypothesis provides a neural basis for the claim that hand movements grounded the evolution of language. Arbib (2002) modified and developed the Rizzolatti-Arbib argument to hypothesize seven stages in the evolution of language, with imitation grounding two of the stages. The first three stages are pre-hominid:

111

112

Michael A. Arbib

S1: Grasping. S2: A mirror system for grasping shared with the common ancestor of human and monkey. S3: A simple imitation system for grasping shared with common ancestor of human and chimpanzee. The next three stages then distinguish the hominid line from that of the great apes: S4: A complex imitation system for grasping. S5: Protosign, a manual-based communication system, breaking through the fixed repertoire of primate vocalizations to yield an open repertoire. S6: Protospeech, resulting from the ability of control mechanisms evolved for protosign coming to control the vocal apparatus with increasing flexibility. I hypothesize that protosign built up vocabulary by variations on moving handshapes along specific trajectories to meaningful locations; whereas protospeech “went particulate”, with the emergence of a relatively limited stock of syllables or phonemes — but not to the exclusion of a “phonology” that combined protosign and protospeech within a rich manual-facial-vocal domain. In my view, these six stages do not (in general) replace capabilities of the ancestral brain so much as they enrich those capabilities by embedding them in a more powerful system. The final stage: S7: Language: the change from action-object frames to verb-argument structures to syntax and semantics; the historical (rather than biological) co-evolution of cognitive and linguistic complexity. is claimed to involve little if any biological evolution, but instead to result from cultural evolution (historical change) in Homo sapiens. However, for the present article the crucial assertion is that Stage S5, protosign, made essential contributions to Stage S6, the emergence of protospeech and that it is mistaken to view speech as evolving without the scaffolding of protosign. As far as the development of grasping itself is concerned, ILGM (the Infant Learning to Grasp Model; Oztop, Arbib & Bradley, 2004) models how the human or macaque infant develops a set of grasps which succeed by kinesthetic, somatosensory criteria, while the MNS model (Mirror Neuron System model; Oztop & Arbib, 2002) demonstrates how the (grasp) mirror neuron system develops driven by visual stimuli relating hand and object generated by the actions (grasps) performed by the infant himself. The human infant (with maturation of visual acuity) gains the ability to map other individual’s actions into his internal motor representation,

Interweaving protosign and protospeech

and then the infant acquires the ability to create (internal) representations for novel actions observed and develops an action prediction capability (though we have not yet modeled these latter stages). These models support the claim for the essential plasticity of the mirror system — that it does not come prewired with a specific set of grasps, but rather expands its repertoire to adaptively encode the experience of the agent. This is an important aspect for the emergence of increasing manual dexterity. It becomes crucial when we argue for the essential role of a mirror system for phonological articulation at the heart of the brain’s readiness for language. Fogassi and Ferrari (2004) provide further discussion of “mirror neurons, gestures and language evolution”. In particular, they review data supporting the claims that (i) vocalizations of nonhuman primates did not serve as the precursors of human speech while (ii) nonhuman primate gestures might reflect not only emotional states but also the cognitive processes underlying their production.

MacNeilage’s frame/content theory: A critique Each spoken language has a relatively small fixed set of phonological units which have no independent meaning but can be combined and organized in the construction of word forms. These units vary from language to language, but in each spoken language these units involve the choreographed activity of the vocal articulators, the lips, tongue, vocal folds, velum (the port to the nasal passages), and respiration (see Byrd & Saltzman, 2003, for a review of speech production). But what are these units? Different authors have argued for features, gestures, phonemes (roughly, segments), moras, syllables, subsyllabic constituents (such as the syllable onset, nucleus, rime, and coda), gestural structures, or metrical feet (see Ladefoged, 2001). Here, I will focus on the work of MacNeilage and Davis (MacNeilage 1998; MacNeilage & Davis, 2005) which emphasizes syllables as the basic unit of speech. They distinguish three levels of mammalian vocal production: – Respiration: the basic cycle is the inspiration-expiration alternation and the expiratory phase is modulated to produce vocalizations. – Phonation: the basic cycle is the alternation of the vocal folds between an open and closed position ("voicing" in humans). This cycle is modulated by changes in vocal fold tension and subglottal pressure level, producing variations in pitch. – Articulation: In their view, this is based on the syllable, defined in terms of a nucleus with a relatively open vocal tract, and margins with a relatively closed vocal tract. Modulation of this open-close cycle in humans takes the form of typically producing different phonemes, consonants (C) and vowels (V), respectively, in successive closing and opening phases.

113

114

Michael A. Arbib

MacNeilage (1998) offers the frame/content theory of evolution of speech production grounded in this view of the CV syllable as the basic unit of articulation, and he argues strongly that language evolved directly as speech, while denying a role for manual gesture in language evolution. However, in evaluating his theory, one must note that articulation is not language. Recalling Figure 1, we must distinguish: – syllabic vocalization (a substructure for Phonological Form in the speech domain, not even all of spoken Phonological Form itself) from – spoken language which at the minimum must include Semantic Form and Phonological Form and their linkage. Thus my critique of MacNeilage’s evolutionary theory has two parts: to raise some questions about his account of the evolution of syllabic vocalization (admittedly, this critique will be brief and somewhat amateurish) and (more centrally) to observe that his original theory gives no insight into the evolution of speech in the sense of spoken language as distinct from speech as the ability to articulate syllables. We shall see that the updated version of the theory (MacNeilage and Davis, 2005) goes part way, but only part way, to meeting this objection. First, a brief rhetorical jeu d’esprit (certainly not a reasoned scientific criticism) of the claim that the basic syllable is CV. MacNeilage and Davis (2001) assert that: (i) “Words typically begin with a consonant and end with a vowel.” (ii) “The dominant syllable type within words is considered to be consonant–vowel (CV).” However, neither sentence supports its own claim! In (i), only 1 of the 11 words conforms to the claim it makes; while in (ii), only about 6 out of 22 syllables conform! Indeed, the CVC syllable is basic in English, whereas the CV mora is basic in Japanese, but each allows variation from this norm. And note, too, the importance of tones in Chinese and clicks in certain African languages. Jusczyk et al. (1999) found that 9-month-old English learners are sensitive to shared features that occur at the beginnings, but not at the ends of syllables. Specifically, the infants had significant listening preferences for lists in which the items shared either initial CV’s, initial C’s, or the same manner of articulation at syllable onsets. This suggests that infants may only later develop sensitivity to features that occur late in complex syllables. This certainly accords with MacNeilage and Davis’s (2001) argument for the developmental priority of the CV syllable, and it may be true that the CV form is the only universal syllable form in languages (Maddieson, 1999) but one must still explain (as MacNeilage [1998] does not) the evolution of the wide range of other “syllable level” units used across the world’s languages.

Interweaving protosign and protospeech

A more general concern is with the tendency to equate what is seen in the human infant with what must have developed in human evolution. At a trivial level, we know that hominids could make tools and hunt animals, and we know that modern human infants cannot, and so it is dangerous to equate “early in ontogeny” with “early in phylogeny”. Closer to the language issue, note the data on the bonobo Kanzi who was taught to use lexigrams to represent various concepts and could arrange several lexigrams in novel combinations. Savage-Rumbaugh et al. (1998) report that Kanzi and a 2.5 year-old girl were tested on their comprehension of 660 sentences phrased as simple requests (presented once). Kanzi was able to carry out the request correctly 72% of the time, whereas the girl scored 66% on the same sentences and task. This seemed to mark the limits of Kanzi’s abilities but was just the beginning for the human child. This suggests that the brain mechanisms which support the full richness of human language may not be fully expressed in the first two years of life yet, given the appropriate developmental grounding, eventually prove crucial in enabling the human child to acquire language. But let us return to MacNeilage’s Frame/Content (F/C) theory. The central notion is that there is a CV syllable structure frame, into which “content” is inserted prior to output. He argues that the speech frame may have been exapted from the combination of the mandibular cycle originally evolved for chewing, sucking, and licking with laryngeal sounds. However, I would claim that the mandibular cycle is too far back in evolutionary time to serve as an interesting way station on the path toward syllabic vocalization. Moreover, I would argue that the varied nature of the syllable in English calls into question the notion of “modulated cyclicity” as distinct from a sequence of distinct and overlapping gestures. To illustrate this point, consider the evolution from rhythmic movements in our fish-like ancestors to the human capability for discrete goal-seeking movements of the hands. Among the stages we might discriminate are the following: Swimming; visual control of trajectory; locomotion on land; adaptation to uneven terrain (so that two modes of locomotion emerge: modulation of rhythmic leg movements versus discrete steps, e.g., from rock to rock); brachiation; bipedalism; and finally dexterity encompassing, for example, grooming (note the jaw/hand tradeoff), tool use and gesture. It seems to me no more illuminating to root syllable production in mandibular oscillation than to root dexterity in swimming — the evolutionary path is there in each case, but the crucial questions concern the intermediate stages far more than the remote starting points. Schaal et al. (2004) report results from human functional neuroimaging that show significantly different brain activity in discrete and rhythmic movements, although both movement conditions were confined to the same single wrist joint. Rhythmic movements merely activated unilateral primary motor areas, while discrete movements elicited additional activity in premotor, parietal, and cingulate

115

116

Michael A. Arbib

cortex, as well as significant bilateral activation. They suggest that rhythmic and discrete movement may be two basic movement primitives in human arm control which require separate neurophysiological and theoretical treatment. The relevance for the present discussion is to suggest that human speech is best viewed as an assemblage of discrete movements even if it has a rhythmic component, and thus may have required major cortical innovations in evolution for its support. An account of speech evolution needs to shift emphasis from ancient functions to the changes of the last 20 million years that set humans off from other primates — to the transition from the vocalizations that I assume our ancient ancestors shared with modern nonhuman primates to the use of syllabification or other units of vocal articulation for vocal communication. Indeed, MacNeilage and Davis (2004) appear to be moving in this direction for they now note further stages en route to spoken language: – evolution of the mouth close-open alternation (circa 200 million years ago) – visuofacial communication in the form of lipsmacks, tonguesmacks, and teeth chatters (established with the monkey-human common ancestor perhaps 20 million years ago) – pairing of the communicative close-open alternation with phonation to form protosyllabic “Frames”, and – the frame becoming programmable with individual consonants and vowels. I would also note that the human transition to an omnivorous diet may well have been accompanied by what I would call oral dexterity — an increasing set of skills in the coordination of lip, tongue and teeth movements in chewing prior to swallowing. These might be part of the evolutionary path from mandibular oscillations to a form of skilled motor control that has some possibility of evolving to provide articulators and neural control circuits suitable for voluntary vocal communication. The challenge is to fill in the above schematic to provide in detail the evolutionarily plausible stages that get us from chewing to deploying tongue, respiration, larynx, lips, etc., in the service of spoken language. With this it is time to note that whatever its merits as a theory of the evolution of syllabic vocalization, MacNeilage (1998) has nothing to say about the evolution of spoken language in the sense which links Semantic Form and Phonological Form. I think it is only because MacNeilage ignores the distinction between the two senses of “evolution of speech” that he can argue as confidently as he does for the view that (H1) language evolved directly as speech, dismissing claims for the key role of manual gesture in the evolution of language (e.g., hypotheses H2 and H3 above). What is lacking in the 1998 paper is any account that links syllables to communication. Unless one can give some account of how strings of syllables

Interweaving protosign and protospeech

come to have meaning, it is hard to see what evolutionary pressure would have provided selective advantage for the transition from “oral dexterity” to skilled vocal articulation. MacNeilage and Davis (2005, 2006) have at last begun to address the question, giving two different answers: (a) that ingestive movements may form a basis for communication (I shall return to this in some detail below), and (b) that words for “mother” and “father” might have emerged by conventionalization of the infant’s first attempts at syllabification. My response is not to deny these claims, but only to stress that neither seems to establish a rich semantics in the way that pantomime does. With this, the time has come to argue for H3 by suggesting the crucial role of pantomime in the evolution of protosign, and then showing how this may have provided the scaffolding for protospeech and their consequent development in an expanding spiral. The argument for H3 will stand as an implicit critique of H2, the view that fully expressive signed languages developed prior to speech.

The doctrine of the expanding spiral For want of better data, we will assume that our common human-monkey ancestors shared with monkeys the following:

– A Primate Call System: a limited set of species-specific calls; and – An Oro-Facial Gesture System: a limited set of gestures expressive of emotion and related social indicators Note the linkage between the two systems: communication is inherently multimodal. However, combinatorial properties for the openness of communication are virtually absent in basic primate calls and oro-facial communication though individual calls may be graded. Moreover, the neural mechanisms for these primate calls are basically in the midbrain; to the extent that there is cortical control of these calls in nonhuman primates, it is exercised by mesial cortex, an area completely distinct from F5/Broca’s area. Jürgens (2002) found that voluntary control over the initiation and suppression of vocal utterances, in contrast to completely innate vocal reactions such as pain shrieking, relies on the mediofrontal cortex including anterior cingulate gyrus and supplementary as well as pre-supplementary motor area.4 Voluntary control over the subcortical motor pattern generators for vocalization is carried out by the motor cortex via pyramidal/corticobulbar as well as extrapyramidal pathways.5 For most humans, language is heavily intertwined with speech. The mirror system hypothesis offers a compelling explanation of why F5, rather than the mesial

117

118

Michael A. Arbib

cortex already involved in monkey vocalization, is homologous to the Broca’s area’s substrate for language. MacNeilage (1998) offers an answer based on the role of premotor cortex in ingestion, an important point to which I will return below. However, as already noted, MacNeilage (1998) failed to offer an account of “how sounds come to mean” and MacNeilage and Davis (2005, 2006) only take the story as far as ingestive functions and words like “mama” and “papa”. By contrast, the following provides at least the scaffolding for a much richer account. Here, we focus on three stages of the evolutionary sequence hypothesized by Arbib (2002): S4: A complex imitation system for grasping. S5: Protosign, a manual-based communication system, breaking through the fixed repertoire of primate vocalizations to yield an open repertoire. S6: Proto-speech, resulting from the ability of control mechanisms evolved for protosign coming to control the vocal apparatus with increasing flexibility.

I emphasize again that the mirror system is not confined to F5, and that both F5 and the larger system of which it is part had to change during the course of hominid evolution. Monkeys do not imitate (see Arbib, 2005, for further discussion) and so we must infer that having a monkey-like mirror system for grasping is not sufficient for imitation. I thus hypothesize that getting beyond this limitation was one of the crucial evolutionary changes in the “extended” mirror system that in concert with others yielded a specifically human brain. As noted in the discussion of Figure 1, complex imitation — which exploits the ability to recognize that a novel action is in fact composed of (approximations to) known actions — is crucial both to the child’s ability to acquire language and to adult use of language. My hypothesis is that complex imitation for hand movements evolved because of its adaptive value in supporting the increased transfer of manual skills and thus preceded the emergence of protolanguage in whatever modality. Donald (l994; 1999) also argues that mimesis (which, I think, is similar to what I term complex imitation) was a general-purpose adaptation, so that a capacity for vocal-auditory mimesis would have emerged simultaneously with a capacity for manual mimesis. Indeed, it is clear that Stage S6 (which I see as intertwined with S5) does require vocal-auditory mimesis. However, my core argument does not rest on the debate over whether or not manual mimesis preceded vocal-auditory mimesis — but it does insist that complex imitation of hand movements was crucial to the development of an open system of communication. The doctrine of the expanding spiral is the hypothesis that protosign exploits the ability for complex imitation of hand movements to adapt this imitation to the needs of communication, and that the resulting protosign provides scaffolding for protospeech but that both develop together thereafter6 — Stages S5 and

Interweaving protosign and protospeech

S6 are intertwined. The strong hypothesis here is that protosign is essential to this process, so that the full development of protospeech would be impossible without the protosign scaffolding. It does not rest on arguments over whether Stage S5 precedes S6 since it holds that these stages are intertwined. To develop the doctrine, we must distinguish two roles for imitation in the transition from S4 to S5: 1. The transition from praxic action directed towards a goal object to pantomime in which similar actions are produced away from the goal object. But to pantomime the flight of a bird you must use movements of the hand (and arm and body) to imitate movement other than hand movements. You can pantomime an object either by miming a typical action by or with the object, or by movements which suggest tracing out the characteristic shape of the object. Imitation is the generic attempt to reproduce movements performed by another, whether to master a skill or simply as part of a social interaction. By contrast, pantomime is performed with the intention of getting the observer to think of a specific action, object or event. It is essentially communicative in its nature. The imitator observes; the panto-mimic intends to be observed. The transition to pantomime does seem to involve a genuine neurological change. Mirror neurons for grasping in the monkey will fire only if the monkey sees both the hand movement and the object to which it is directed (Umiltá et al., 2001). A grasping movement that is not made in the presence (or with working memory) of a suitable object, or is not directed toward that object, will not elicit mirror neuron firing. By contrast, in pantomime, the observer sees the movement in isolation and infers (i) what non-hand movement is being mimicked by the hand movement, and (ii) the goal or object of the action. This is an evolutionary change of key relevance to language readiness. The very structure of these sequences can serve as the basis for immediate imitation or for the immediate construction of an appropriate response, as well as contributing to the longerterm enrichment of experience. 2. A further critical change en route to language emerges from the fact that in pantomime it might be hard to, for example, distinguish a movement signifying “bird” from one meaning “flying”. This inability to adequately convey shades of meaning using “natural” pantomime would favor the invention of gestures which could in some way combine with (e.g., sequentially, or by modulation of some kind) the original pantomime to disambiguate which of its associated meanings was intended. Note that whereas a pantomime can freely use any movement that might evoke the intended observation in the mind of the observer, a disambiguating gesture must be conventionalized, i.e., it can only be used within a community that has negotiated or learned how it is to be

119

120

Michael A. Arbib

interpreted. This use of non-pantomimic gestures requires extending the use of the mirror system to attend to a whole new class of hand movements, those with conventional meanings agreed upon by the protosign community to reduce ambiguity and extend semantic range. However, this does not seem to require a biological change beyond that limned in the previous paragraph. As Stokoe (2001) and others emphasize, the power of pantomime is that it provides open-ended communication that works without prior instruction or convention. However, I would emphasize that pantomime per se is not a form of protolanguage; rather it provides a rich scaffolding for the emergence of protosign. In the present theory, the crucial ingredient for the emergence of symbolization is the extension of imitation from the imitation of hand movements to the ability to project the degrees of freedom of quite different movements onto hand movements which evoke something of the original in the brain of the observer. This involves not merely changes internal to the mirror system but its integration with a wide range of brain regions involved in the elaboration and linkage of perceptual and motor schemas. When pantomime is of praxic hand actions, that pantomime directly taps into the mirror system for these actions. However, as the pantomime begins to use hand movements to mime different degrees of freedom (as in miming the flying of a bird), a dissociation begins to emerge. The mirror system for the pantomime (based on movements of face, hand, etc.) is now different from the recognition system for the action that is pantomimed, and — as in the case of flying — the action may not even be in the human action repertoire. However, the system is still able to exploit the praxic recognition system because an animal or hominid must observe much about the environment that is relevant to its actions, but is not in its own action repertoire. Nonetheless, this dissociation now underwrites the emergence of actions which are defined only by their communicative impact, not by their praxic goals. Recall the earlier comment (Hurford 2004) distinguishing the mirror system for the sign (phonological form) from the linkage of the sign to the neural schema for the signified. Pantomime is not itself part of protosign but rather a scaffolding for creating it. Pantomime involves the production of a motoric representation through the transformation of a recalled exemplar of some activity. As such, it can vary from actor to actor, and from occasion to occasion. By contrast, the meaning of conventional gestures must be agreed upon by a community. Protosign may lose the ability of the original pantomime to elicit a response from someone who has not seen it before. However, the price is worth paying in that the simplified form, once agreed upon by the community, allows more rapid communication with less neural effort. In the same way, I suggest that pantomime

Interweaving protosign and protospeech

is a valuable crutch for acquiring a modern sign language, but that even signs which resemble pantomimes are conventionalized and are thus distinct from pantomimes.7 Interestingly, signers using American Sign Language (ASL) show a dissociation between the neural systems involved in sign language and those involved in conventionalized gesture and pantomime. For example, Corina et al. (1992) described patient WL with damage to left hemisphere perisylvian regions. WL exhibited poor sign language comprehension, and his sign production had phonological and semantic errors as well as reduced grammatical structure. Nonetheless, WL was able to produce stretches of pantomime and tended to substitute pantomimes for signs, even when the pantomime required more complex movement. Consistent with this, I would argue that the evolution of neural systems involving Broca’s area is adequate bilaterally to support conventionalized gestures when these are to be used in isolation. However, the predisposition for the skilled weaving of such gestures into complex wholes which represent complex meanings which are novel (in the sense of being created on-line rather than being well rehearsed) has become lateralized in humans. In other words, conventionalized gestures that can stand alone can still be used when the adult left hemisphere is damaged, but those which form part of the expression of a larger meaning depend essentially on left hemisphere mechanisms. I have separated S6, the evolution of protospeech, from S5, the evolution of protosign, to stress the point that the role of F5 in grounding the evolution of a protolanguage system would work just as well if we and all our ancestors had been deaf. However, primates do have a rich auditory system which contributes to species survival in many ways of which communication is just one (Ghazanfar, 2003). The hypothesis here, then, is not that the protolanguage system had to create the appropriate auditory and vocal-motor system “from scratch” but rather that it could build upon the existing mechanisms to derive protospeech. My hypothesis is that protosign grounded the crucial innovation of using arbitrary symbolic gestures to convey novel meanings, and that this in turn provided the scaffolding for protospeech. Consistent with my view that true language emerged during the history of Homo sapiens and the observation that the vocal apparatus of humans is especially well adapted for speech, I suggest that the interplay between protospeech and protolanguage was an expanding spiral which yielded a brain that was ready for language in the multiple modalities of gesture, vocalization, and facial expression. As we shall see below, monkeys already have oro-facial communicative gestures and these may certainly support a limited communicative role. However, their calls and gestures lack the ability of pantomime to convey a rich and varied repertoire of meanings without prior conventionalization. The set of species-specific vocalizations of a species of nonhuman primates is closed in the sense that it is restricted to a specific repertoire. This is to be con-

121

122

Michael A. Arbib

trasted with human languages which are open both as to the creation of new vocabulary (a facility I would see as already present in protosign and protospeech) and the ability to combine words and grammatical markers in diverse ways to yield an essentially unbounded stock of sentences (a facility I would see as peculiar to language as distinct from protolanguage). Here it must be noted that the notion of mirror neuron is proving more elastic than might have appeared from the early studies. Kohler et al. (2002) found that 15% of mirror neurons in the hand area of F5 can respond to the distinctive sound of an action (breaking peanuts, ripping paper, etc.) as well as viewing the action. Ferrari et al. (2003) show that the oro-facial area of F5 (adjacent to the hand area) contains a small number of neurons tuned to communicative gestures (lipsmacking, etc.) but the observation and execution functions of these neurons are not strictly congruent — most of the neurons are active for execution of ingestive actions, e.g., one “observed” lip protrusion but “executed” syringe sucking (Ferrari et al., 2003). Some might argue that the Kohler et al. neurons let one apply the Mirror System Hypothesis to vocalization directly, “cutting out the middleman” of manual gesture. However, the sounds studied by Kohler et al. (2002) cannot be created in the absence of the object and there is no evidence that monkeys can use their vocal apparatus to mimic the sounds they have heard. My preferred hypothesis, then, is that the limited number and congruence of these neurons is more consistent with the view that – manual gesture is primary in the early stages of the evolution of languagereadiness, but that – audio-motor neurons lay the basis for later extension of protosign to protospeech, and that – the protospeech neurons in the F5 precursor of Broca’s area may be rooted in ingestive behaviors. The usual evolutionary caveat: Macaques are not ancestral to humans. What is being said here is shorthand for the following: (a) There are ingestion-related mirror neurons observed in macaque. (b) I hypothesize that such neurons also existed in the common ancestor of human and macaque of 20 million years ago. (c) Noting with Fogassi and Ferrari (2004) that there is little evidence of voluntary control of vocal communication in non-human primates, I further hypothesize that evolution along the hominid line (after the split 5 million years ago between the ancestors of the great apes and those of humans) expanded upon this circuitry to create the circuitry for protospeech.

Interweaving protosign and protospeech

The notion, then, is that the manual domain supports the expression of meaning by sequences and interweavings of gestures, with a progression from “natural” to increasingly conventionalized gesture to speed and extend the range of communication within a community. I then argue that Stage S5 (protosign) provides the scaffolding for Stage S6 (protospeech). We have already seen that some mirror neurons in the monkey are responsive to auditory input. We now note that there are orofacial neurons in F5 that control movements that could well affect sounds emitted by the monkey. The speculation here is that the evolution of a system for voluntary control of intended communication based on F5/Broca’s area could then lay the basis for the evolution of creatures with more and more prominent connections from F5/Broca’s area to the vocal apparatus. This in turn could provide conditions that lead to a period of co-evolution of the vocal apparatus and the neural circuitry to control it. Leo Fogassi (personal communication) comments that “Although the primacy of manual gesture alone could still be a matter of discussion, the hypothesis we presented [i.e., Fogassi & Ferrari, 2004] is not very different from yours. What we have to show is whether audiomotor neurons are present in a similar percent also among mouth related neurons and whether in monkeys there is already, in ventral premotor cortex, a proto-vocal apparatus unrelated to emotional behavior.”

However, as noted in my earlier discussion of Donald on mimesis, my argument does not rest on the debate over whether or not a proto-manual communication system preceded the emergence of a proto-vocal apparatus. Rather, my essential claim is that complex imitation of hand movements was crucial to the development of an open system of communication. Complementing earlier studies on hand neurons in macaque F5, Ferrari et al. (2003; see also Fogassi & Ferrari, 2004) found that about one-third of mouth motor neurons in F5 also discharge when the monkey observes another individual performing mouth actions. The majority of these “mouth mirror neurons” become active during the execution and observation of mouth actions related to ingestive functions such as grasping with the mouth, sucking or breaking food. Another population of mouth mirror neurons also discharges during the execution of ingestive actions, but the most effective visual stimuli in triggering them are communicative mouth gestures (e.g., lip smacking). This fits with the hypothesis that neurons learn to associate patterns of neural firing rather than being committed to learn specifically pigeonholed categories of data. Thus a potential mirror neuron is in no way committed to become a mirror neuron in the strict sense, even though it may be more likely to do so than otherwise. The observed communicative actions (with the effective executed action for different “mirror neurons” in parentheses)

123

124

Michael A. Arbib

include lip-smacking (sucking, sucking and lip smacking); lips protrusion (grasping with lips, lips protrusion, lip smacking, grasping and chewing); tongue protrusion (reaching with tongue); teeth-chatter (grasping); and lips/tongue protrusion (grasping with lips and reaching with tongue; grasping). Ferrari et al. (2003, p. 1713) state that “the knowledge common to the communicator and the recipient of communication about food and ingestive action became the common ground for social communication. Ingestive actions are the basis on which communication is built” (my italics). Thus their strong claim “Ingestive actions are the basis on which communication is built” might better be reduced to “Ingestive actions are the basis on which communication about feeding is built” which complements but does not replace communication about manual skills. And certainly, manual skills support pantomime in a way which facial contortions cannot.

Conclusions I have argued that it was the discovery in protosign that arbitrary gestures could be combined to convey novel meanings that provided the essential scaffolding for the transition from the closed set of primate vocalizations to the limited openness of protospeech. Association of vocalization with manual gestures allowed them to assume a more open referential character, and to exploit the capacity for imitation of the underlying brachio-manual system. The claim is not that no meaning can evolve within the oro-facial domain, but only that the range of such meanings is impoverished compared with those expressible by pantomime. I claim that it is easy to share a wide range of meanings once the mirror system for grasping evolves in such a way as to support pantomime, and that the need for disambiguation then creates within a community a shared awareness of the use of conventional gestures as well as iconic gestures — whereas onomatopoeia seems to be far more limited in what can be conveyed. I thus hypothesize that Homo habilis and even more so Homo erectus had a “proto-Broca’s area” based on an F5-like precursor mediating communication by manual and oro-facial gesture. This made possible a process whereby this “proto” Broca’s area gained primitive control of the vocal machinery, thus yielding increased skill and openness in vocalization, moving from the fixed repertoire of primate vocalizations to the unlimited (open) range of vocalizations exploited in speech. Speech apparatus and brain regions could then co-evolve to yield the configuration seen in modern Homo sapiens. Fogassi (personal communication) casts doubt on the evolutionary primacy of hand gestures, noting that in monkeys hand and mouth gestures are strongly related and, when the same neurons code both

Interweaving protosign and protospeech

effectors, they often convey more abstract meaning than neurons coding separately hand or mouth. However, the essence of my argument is not that Stage S5: S5: Protosign, a manual-based communication system, breaking through the fixed repertoire of primate vocalizations to yield an open repertoire; must have primacy over Stage S6 S6: Protospeech, resulting from the ability of control mechanisms evolved for protosign coming to control the vocal apparatus with increasing flexibility; but only that protospeech had to be integrated with protosign to achieve a situation in which a wide set of meanings were ripe for vocal expression, thus providing the selective pressure for the emergence of speech. This approach is motivated in great part by the multimodal features of real human spoken language communication, as opposed to the idealized view which equates language with the spoken form of written language. McNeill (1992) has used videotape analysis to show the crucial use that people make of gestures synchronized with speech. Pizzuto, Capobianco, and Devescovi (2005) stress the interaction of vocalization and gesture in early language development (see also Iverson & Goldin-Meadow, 1998). Moreover, multimodal links are quite striking in signed language communication, where a wealth of “vocal” gestures are observed These have been explored much more in research on European signed languages than in US research on ASL and there is little ground to believe that these oral gestures in signs are primarily due to the influence of “oral education” on deaf people (Elena Pizzuto, personal communication8). A “speech only” evolutionary hypothesis leaves mysterious the availability of this vocal-manual-facial complex which not only supports a limited gestural accompaniment to speech but also the ease of acquisition of signed languages for those enveloped within it in infancy. However, the “protosign scaffolding” hypothesis has the problem of explaining why speech became favored over gestural communication. Standard arguments (see, e.g., Corballis, 2002) include the suggestions that, unlike speech, signing is not omni-directional, does not work in the dark, and does not leave the hands free. But does the advantage of communication in the dark really outweigh the choking hazard of speech? Karen Emmorey (personal communication) notes that signers are perfectly good at signing while showing someone how to use a tool, and sign might actually be better than speech when the tool is not present. Moreover Falk (2004; but see also MacNeilage & Davis, 2005) argues that vocal “motherese” in chimpanzees serves to bond parent and child and other kin and thus could provide a direct precursor to speech. To address this, it may help to consider a hypothetical story concerning the emergence of protospeech on a protosign scaffolding. When someone bites into a

125

126

Michael A. Arbib

piece of sour fruit by mistake they make a characteristic face and intake of breath when tasting the bitter fruit. This could have been the basis for the innovation of using a conventionalized variant of this reaction as the symbol for “sour” in a tribe. Note that this symbol for “sour” is a vocal-facial gesture, not a manual gesture, and thus could contribute to a process whereby protolanguage began its transition from the predominance of protosign to the predominance of protospeech by at first using protospeech symbols to enrich protosign utterances, whereafter an increasing number of symbols would have become vocalized, freeing the hands to engage flexibly in both praxis and communication as desired by the “speaker”. Similarly, the ability to create novel sounds to match degrees of freedom of manual gestures (for example, rising pitch might represent an upward movement of the hand), coupled with a co-evolved ability to imitate novel sound patterns with onomatopoeia yielding vocal gestures not linked to manual gestures, could have helped create the early vocal repertoire. However, articulatory gestures alone do not have the rich ad hoc communicative potential that pantomime provides with the manual gesture system. This still leaves open the question “If protosign was so successful why did spoken languages come to predominate over signed languages?” As in much evolutionary discussion, the answer must be post hoc. One can certainly imagine a mutation which led to a race of deaf humans who nonetheless prospered mightily as they built cultures and societies on the rich adaptive basis of signed languages. So the argument is not that speech must triumph, any more than that having a mirror system must lead to (proto)language. Let me just note that the doctrine of the expanding spiral (H3) is far more hospitable to the eventual answer to this question than is (H2), the view that protosign yielded full signed languages preceded the emergence of speech. For, indeed, if hominid protolanguage combined protosign and protospeech there is no issue of speech displacing a fully successful system of signed language. Having said this, I think it may be mistaken to expect the question of why speech predominates over signing in human evolution to have a quasi-biological answer. Instead, the answer may actually lie in the vagaries of human history. It is a historical fact that humans have evolved writing systems that may be ideographic (Chinese), or based on a syllabary (Japanese kana) or alphabetic (as in English). It is also a historical fact that alphabetic systems are in the ascendancy and that the Chinese move towards an alphabetic system (pinyin) was probably reversed only because word processing removed the problems posed by trying to make efficient use of typewriters for a stock of thousands of characters. One advantage of writing (of course!) is that it leaves a written record that can be used in teasing apart the historical ebb and flow in the development and propagation of different writing systems (see, e.g., Coulmas, 2003). No such record is available for the ebb and flow

Interweaving protosign and protospeech

of speech and signing. We do know that some aboriginal Australian tribes (Kendon, 1988), and some native populations in North America (Farnell, 1995) use both signing and speech; and Kendon has republished the classic text on Neapolitan signing (de Jorio, 2000). As a result, I am not convinced that a case need be made for driving “the spiral toward speech” — it is enough to have shown how protosign created a link to the use of arbitrary gestures to convey meaning as a scaffolding to creating such a link to protospeech, and that thereafter advances in either system could create a space for advances in the other. To summarize: I have argued for the following scheme (Figure 1), S4: A complex imitation system for manual and orofacial gestures S5: Protosign, a manual-based communica- S6: Proto-speech, resulting from the ability tion system, breaking through the fixed of control mechanisms evolved for protorepertoire of primate vocalizations to yield sign coming to control the vocal apparatus an open repertoire. with increasing flexibility. S7: Language in a multi-modal vocal-manual-facial domain

where I have placed S5 and S6 on the same line of the table to emphasize that the issue is not the primacy of S5 but rather the necessity of protosign in supporting the development of semantic forms rich enough to drive the full emergence of protospeech. And I have argued against the “vocal only” scheme: A complex imitation system for vocalization Proto-speech, in the sense of mechanisms to control the vocal apparatus with increasing flexibility. Language in the vocal domain

Note that this argument says nothing against the importance of understanding the evolution of a vocal apparatus (including its neural control) adequate to the demands of speech. However, my contention is that the doctrine of the expanding spiral can provide an approach to that evolution that is richer than the quest for a vocal-only account such as that espoused by MacNeilage (1998; MacNeilage & Davis 2005).

Acknowledgements Preparation of the present paper was supported in part by a Fellowship from the Center for Interdisciplinary Research of the University of Southern California. My thanks to Oana Benga, David Caplan, Barbara Davis, Karen Emmorey, Leo Fogassi, Peter MacNeilage and Elena Pizzuto for their comments on an earlier draft of the article.

127

128

Michael A. Arbib

Notes 1. Corballis (2002, 2004) has stated positions that at times read like an argument for H2, though it appears (personal communication) that he would agree with H3. 2. They also review data on auditory linkages which will be discussed later. 3. Since we will be concerned in what follows with signed as well as spoken language, I ask the reader to note that “speaker” and “hearer” may actually be using hand and face gestures rather than, or in addition to, vocal gestures for communication. 4. For further data contrary to my argument on the role of manual gesture in the evolution of speech, see the commentaries on Arbib (2005) by Barrett, Foundas & Heilman; Bosman, López, & Aboitiz; Emmorey; MacNeilage & Davis; Rauschecker; and Seyfarth. On the other hand, McNeill, Bertenthal, Cole & Gallagher show how strongly manual gesture is integrated into the speech of humans, thus weakening claims for a speech-only view of language. 5. Benga (2004) argues that the evolution of vocal speech involved a shift in control from anterior cingulate cortex to Broca’s area in order to include vocal elements in intentional communication. She favors the strategic view of the involvement of the anterior cingulate cortex in selection for action, suppression of automatic/routine behaviors, and error correction; while noting that some researchers hold that the anterior cingulate cortex has only the evaluative functions of error detection and conflict monitoring. 6. To consider the metaphor further: It is not necessary to complete the scaffolding before one begins to construct the walls. Rather one may construct the scaffolding for each floor as the preceding floor is completed. In similar vein, protosign, it is claimed, provided essential support for key elements of the construction of protosign. However, it may not be fruitful to develop the metaphor too literally if it is taken to imply that all trace of the scaffolding is removed once the building is completed. Rather, I argue that the human brain evolved to support protosign as well as protospeech, that it is a historical fact that spoken languages have predominated over sign languages, but that the brain mechanisms that support human language are not specialized for speech but rather support communication in an integrated manual-facial-vocal multi-modal fashion. 7. Modern sign languages are fully expressive human languages, and so must not be confused with protosign — we use signing here to denote manually based linguistic communication, as distinct from the reduced use of manual gesture that is the frequent accompaniment of speech. 8. See Braem and Sutton-Spence (2001) for evidence on the widespread use of mouthing and mouth gestures, with and without sound, in several European and one Asian signed language, and the review by Pizzuto (2003a). See also Section 3 of Pizzuto (2003b) for more on multimodal features in signed communication.

Interweaving protosign and protospeech

References Arbib, M. A. (2002). The mirror system, imitation, and the evolution of language. In: Nehaniv, C. & Dautenhahn, K. (Eds.), Imitation in Animals and Artifacts (pp. 229–280). Cambridge, MA: MIT Press. Arbib, M. A. (2004). How far is language beyond our grasp? A response to Hurford. In: Oller, D. K. and Griebel, U. E., Evolution of Communication Systems: A Comparative Approach (pp. 315–321). Cambridge, MA: MIT Press. Arbib, M. A. (2005). From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics. Behavioral and Brain Sciences 28, 105–167. Arbib, M. A., & Bota, M. (2003). Language evolution: Neural homologies and neuroinformatics. Neural Networks, 16, 1237–1260. Arbib, M. A., and Rizzolatti, G. (1997). Neural expectations: a possible evolutionary path from manual skills to language. Communication and Cognition, 29, 393–424. Benga, O. (2004). Intentional communication and the anterior cingulate cortex. Interaction Studies: Social Behaviour and Communication in Biological and Artificial Systems 6(2), 201–221. (This issue.) Benson, D. F., & Ardila, A. (1996). Aphasia: A clinical perspective. New York, Oxford: Oxford University Press. Braem, P. B., & Sutton-Spence, R. (Eds.) (2001). The hands are the head of the mouth — The mouth as articulator in sign languages. Hamburg: Signum. Byrd, D., & Saltzman, E. (2003). Speech production. In: Arbib, M. A. (Ed.), The handbook of brain theory and neural networks, Second Edition (pp. 1072–1076). Cambridge, MA: A Bradford Book/MIT Press. Corballis, M. C. (2002). From hand to mouth: The origins of language. Princeton: Princeton University Press. Corballis, M. C. (2004). The origins of modernity: Was autonomous speech the critical factor? Psychological Review, 111(2), 543–552. Corina, D. P., Poizner, H., Bellugi, U., Feinberg, T., Dowd, D., & O’Grady-Batch, L. (1992a). Dissociation between linguistic and nonlinguistic gestural systems: A case for compositionality. Brain Lang. 43(3), 414–47. Coulmas, F. (2003). Writing systems: An introduction to their linguistic analysis. Cambridge: Cambridge University Press. de Jorio, A. (2000). Gesture in Naples and gesture in classical antiquity. A translation of La mimica degli antichi investigata nel gestire napoletano, Gestural expression of the ancients in the light of Neapolitan gesturing with an introduction and notes by Adam Kendon. Bloomington: Indiana University Press. Donald, M. (1994). Origin of the modern mind. Cambridge, MA: Harvard University Press. Donald, M. (1999). Precursors for the evolution of protolanguages. In: Corballis, M. C. & Lea, S. E. G. (Eds.), Preconditions for the evolution of protolanguages (pp. 138–154). Oxford: Oxford University Press. Falk, D. (2004). Prelinguistic evolution in early hominins: Whence motherese. Behavioral and Brain Sciences, 27(4), 491–503. Farnell, B. (1995). Do you see what i mean? Plains indian sign talk and the embodiment of action. Austin: University of Texas Press.

129

130

Michael A. Arbib

Ferrari, P. F., Gallese, V., Rizzolatti, G., & Fogassi, L. (2003). Mirror neurons responding to the observation of ingestive and communicative mouth actions in the monkey ventral premotor cortex. European Journal of Neuroscience, 17, 1703–1714. Fogassi, L., & Ferrari, P. F. (2004). Mirror neurons, gestures and language evolution, Interaction Studies: Social Behaviour and Communication in Biological and Artificial Systems 5(3), 345– 364. Ghazanfar, A. A. (Ed.) (2003). Primate audition: Ethology and neurobiology. Boca Raton: CRC Press. Hurford, J. H. (2004). Language beyond our grasp: What mirror neurons can, and cannot, do for language evolution. In: Kimbrough Oller, D. & Griebel, U. (Eds.), Evolution of Communication Systems: A Comparative Approach, (pp. 297–313), Cambridge, MA: The MIT Press. Iverson, J. M., & Goldin-Meadow, S. (Eds.) (1998). The nature and function of gesture in children’s communication, New Directions for Child Development 79. Jürgens, U. (2002). Neural pathways underlying vocal control. Neuroscience and Biobehavioral Review, 26, 235–258. Jusczyk, P. W., Goodman, M. B., Baumann, A. (1999). Nine-month-olds’ attention to sound similarities in syllables. Journal of Memory and Language, 40(1), 62–82. Kendon, A. (1988). Sign languages of aboriginal Australia: Cultural, semiotic, and communicative perspectives. Cambridge: Cambridge University Press. Kohler, E., Keysers, C., Umiltà, M. A., Fogassi, L., Gallese, V., & Rizzolatti, G. (2002). Hearing sounds, understanding actions: Action representation in mirror neurons. Science, 297, 846–848. Ladefoged, P. (2001). A course in phonetics, Fourth Edition. Orlando: Harcourt College Publishers. MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. Behav. Brain Sci. 21, 499–546. MacNeilage, P. F., & Davis, B. (2005). The frame/content theory of evolution of speech: A comparison with a gestural origins alternative. Interaction Studies: Social Behaviour and Communication in Biological and Artificial Systems 6(2), 173–199. (This volume.) MacNeilage, P. F. & Davis, B. L. (2006). Baby talk and the emergence of first words. Commentary on Falk, D., Prelinguistic evolution in early hominins: Whence motherese. Behavioral and Brain Sciences, 27(4), 517–518. McNeill, D. (1992). Hand and mind — What gestures reveal about thought. Chicago: University of Chicago Press. Oztop, E., & Arbib, M. A. (2002). Schema design and implementation of the grasp-related mirror neuron system. Biological Cybernetics, 87, 116–140. Oztop, E., Bradley, N. S., & Arbib, M. A. (2004). Infant grasp learning: A computational model. Exp. Brain Res. 158, 480–503. Pizzuto, E. (2003a). Review of The hands are the head of the mouth — The mouth as articulator in sign languages, Penny Boyes Braem and Rachel Sutton-Spence (Eds.). Sign Language & Linguistics, 6, 300–305. Pizzuto, E. (2003b). Coarticolazione e multimodalità nelle lingue dei segni: Dati e prospettive di ricerca dallo studio della lingua dei segni italiana (LIS). In: Marotta, G. & N. Nocchi, N. (Eds.), La Coarticolazione — Atti delle XIII Giornate di Studio del Gruppo di Fonetica Sperimentale (A.I.A.), (pp. 59–77), Pisa: Edizioni ETS.

Interweaving protosign and protospeech

Pizzuto, E., Capobianco, M., & Devescovi, A. (2004). Gestural-vocal deixis and representational skills in early language development. Interaction Studies: Social Behaviour and Communication in Biological and Artificial Systems 6(2), 223–252. (This issue.) Rizzolatti, G., & Arbib, M. A. (1998). Language within our grasp. Trends in Neurosciences, 21(5), 188–194. Savage-Rumbaugh, S., Shankar, S. G., & Taylor, T. T. (1998). Apes, language, and the human mind. New York: Oxford University Press. Schaal, S., Sternad, D., Osu, R., & Kawato, M. (2004). Rhythmic arm movement is not discrete. Nature Neuroscience, 7, 1136–1143. Stokoe, W. C. (2001). Language in hand: Why sign came before speech. Washington, DC: Gallaudet University Press. Tomasello, M. (1999). The human adaptation for culture. Annu. Rev. Anthropol., 28, 509–529. Umiltá, M. A., Kohler, E., Gallese, V., Fogassi, L., Fadiga, L., Keysers, C., & Rizzolatti, G. (2001). I know what you are doing. A neurophysiological study. Neuron 31(1), 155–65. Wray, A. (1998). Protolanguage as a holistic system for social interaction. Language & Communication 18, 47–67.

About the author Michael A. Arbib is the Fletcher Jones Professor of Computer Science, as well as a Professor of Biological Sciences, Biomedical Engineering, Electrical Engineering, Neuroscience and Psychology at the University of Southern California. His current research focuses on neural mechanisms underlying the coordination of perception and action and their implications for the evolution of human language. He is also active in neuroinformatics and in extracting lessons from the analysis of vertebrate brain organization for the design of novel computer architectures integrating learning, cooperative computation and perceptual robotics.

131

The Frame/Content theory of evolution of speech A comparison with a gestural-origins alternative Peter F. MacNeilage and Barbara L. Davis The University of Texas at Austin

The Frame/Content theory deals with how and why the first language evolved the present-day speech mode of programming syllable “Frame” structures with segmental (consonant and vowel) “Content” elements. The first words are considered, for biomechanical reasons, to have had the simple syllable frame structures of pre-speech babbling (e.g., “bababa”), and were perhaps parental terms, generated within the parent–infant dyad. Although all gestural origins theories (including Arbib’s theory reviewed here) have iconicity as a plausible alternative hypothesis for the origin of the meaning-signal link for words, they all share the problems of how and why a fully fledged sign language, necessarily involving a structured phonology, changed to a spoken language. Keywords: speech, sign language, evolution, words, syllables

Introduction What is the history of medium use in the evolution of language? One proposal is that the vocal-auditory channel, in universal use in extended cultures today, was the original medium. Another proposal suggests that the visual-gestural medium, used today only by profoundly deaf individuals, was the original medium for communication and was later replaced by communication in the vocal-auditory channel. The Frame/Content (F/C) theory (see especially MacNeilage, 1998; MacNeilage & Davis, 2000; 2001) is the only relatively comprehensive theory focused on how the vocal-auditory medium of language evolved. While Lieberman’s (1984) two-tube vocal-tract theory addressed the question of how the vocal-auditory medium became sufficiently versatile for speech, it did not speak to the more central question of how the organization of speech evolved. Here we will outline our F/C theory of

134

Peter F. MacNeilage and Barbara L. Davis

evolution of the organization of speech and compare it with the gestural-origins theory of Arbib (Arbib & Rizzolatti, 1997; Rizzolatti & Arbib, 1998, and this issue). A basic perspective underlying our work is one of embodiment — derivation of phonological structure from the structures and functions of the parts of the body that produce speech. The history of the vocal apparatus and its use are central to the construction of the necessary Darwinian “Descent with Modification” scenario (Darwin, 1859, p. 420) for evolution of speech forms. Successful use of the apparatus to respond to selection pressures for effective social communication is surely what led to speech as it is today. How do we propose to derive phonological form from consideration of the structure-function relationships of the oral apparatus? Consider one issue. Speech is composed of syllables and the canonical form of syllables involves a relatively rhythmic alternation between a closed position of the mouth for consonants and an open position for vowels. The consonant-vowel syllable is the only universal syllable type (Bell & Hooper, 1978). The main structure responsible for this mouth close-open alternation is the mandible (MacNeilage & Davis, 1990). The mandible makes radial movements to open and close the mouth, rotating about its axis on the cranium, as shown in Figure 1. In our pursuit of a classical descent-with-modification perspective on the structure and function of the speech apparatus, a central question becomes “What was the origin and the history of use of the mandible which culminated in its role of master controller of the syllable?”

Figure 1. Schematic view of the mandible illustrating its rotation about its axis on the cranium. Mandibular elevation (a closed mouth) is associated with consonants, and mandibular depression (an open mouth) is associated with vowels.

The Frame/Content theory of evolution of speech

Now let us put the question of the history of medium use for language in a more general perspective. Of considerable importance to any comprehensive proposal of a functional-origins scenario for the phonological component of modern language, as seen from a descent-with-modification perspective, is a sense of logical continuity. According to Wray (2003), tracing a common thread can properly include three stages. First come precursor behaviors, which may not be directly or explicitly related to language, but to creation of an infrastructure for more complex developments. For us, mandibular oscillation is one such precursor behavior. Arbib (this issue) talks of this stage in terms of evolution of a “language ready” brain. Next in the evolutionary process, more proximal triggers become apparent, and their connection to language form is more transparent. Finally, in shallow time, evolutionary scenarios involve refining the language system, such that the particular shape of modern language begins to emerge in response to more precisely language-based pressures on message-transmission capabilities. Let us consider in more detail the precursor structure and function that we have already identified. The mandible originally evolved as a modification of a pair of gill arches in response to selection pressures for efficient food capture in jawed fishes, circa 400–450 million years ago (Radinsky, 1987). Presumably the biphasic mandibular cycle was a subsequent modification of this original function of food capture. The modification apparently originated in paleomammals (circa 200 million years ago) in response to selection pressures for more efficient food processing necessitated by the change to internal temperature regulation in mammals relative to their cold-blooded reptilian ancestors (Radinsky, 1987). Not only was such an alternation an essential part of the feeding adaptation whereby mammalian infants gained nutrition from their mothers via sucking, but it also lay behind two other evolving feeding adaptations, namely, chewing and licking. Another important precursor behavior to speech, in our opinion, was more immediate: Evolution of a general-purpose mimetic capacity that according to Donald (1994: 1999) may have arisen in Homo erectus. This important advance made possible the virtually unlimited modern abilities to perform various sociocultural actions such as sports, games, singing, ballet, opera, etc. Donald’s argument is that mimesis originally evolved as a response to selection pressures for group solidarity, originally manifest in group rituals and dances. He suggests that mimesis provided necessary cognitive-motor preconditions for language without which a versatile phonological component of speech would not have been possible. His proposal is consistent with many present perspectives suggesting that domain-general cognitive skills have been harnessed late in evolution for new functions related to support of language use (e.g., Tomasello, 2003).

135

136

Peter F. MacNeilage and Barbara L. Davis

In the second, proximal trigger phase, where the connection to language form should become more apparent, we will argue that our Frame-Content vocal-origins scenario, centering on a new role for mandibular oscillation, allows a direct linkage to modern language form without the necessity of the massive translation process from manual to vocal output required by gestural-origins theories — a translation from a set of bodily structures (shoulders, arms, hands, and fingers) supporting gestural communication to the vastly different set of bodily structures supporting vocal communication (jaw, tongue, velum, etc.). These diverse structural complexes are not easily overlaid on one another either at the physiological level of their interaction in action sequences, or at the neural level. In our view, such scenarios have insufficient regard for the structural properties of the two output systems and the constraints that the structures place on function. In this regard, it should be noted that such a massive gestural-to-vocal change would involve not only change in the production aspect of language form but an equally daunting need for change in the perception channel. The efficient link between production and perception channels is another basic requirement to be considered in evolution toward a modern language system. That is, such a system requires not only rapid encoding but also rapid decoding. It is not parsimonious or, in our view, necessary to suggest that the evolutionary progression toward modern language required a switch in the sender-to-receiver component from a visual to an auditory mode. Another aspect of this proximal trigger phase required by adherence to a neo-Darwinian scenario is the necessity for identifying selection pressures (social communication pressures, in this case) providing an impetus for adaptive responses. Our proposal incorporates a dual hypothesis related to these social/environmental pressures. First, dyadic communication between infant and caregiver provides a powerful proximal trigger for vocal communication in the necessity for infant–parent social-emotional bonding. This bonding is an adaptive response to selection pressures toward survival of offspring. Vocalizations would naturally be based on vocal forms already available from non-speech-related capacities, including those founded in rhythmic mandibular oscillation. We will argue later that the caregiver–infant dyad has both ecological structure and, according to F/C theory, the phonetic underpinnings that allowed the advent for pairing of sounds with meanings to form the first words. We propose that these pairings could have arisen functionally as labels for the intimate partner in multiple caregiver rituals. The act of spontaneously linking sounds with contextually appropriate meanings could eventually lead to the “Nominal Insight” (McShane, 1979) — the insight that things can be assigned spoken names.

The Frame/Content theory of evolution of speech

A second crucial social pressure driving adaptive response toward use of the vocal channel can be postulated at the level of organization among speakers and listeners. Building on Donald’s (1994; 1999) proposal for early mimetic capacity whereby humans in social groups engaged in group-level matching of action patterns in the service of cohesiveness, we would argue that the vocal capacities we are sketching for early dyadic communication would plausibly have been present and useful at the level of the community. Pragmatic functions such as requesting, showing, or negating could logically have begun to rely on the co-opting of nonspeech capacities for vocal communication between adult community members as well as between adults and infants. These vocal signals would have functioned to support crucial community interactions for survival related to procreation, protective responses to predators, or locating and managing food sources. The later stages of evolutionary change wherein the precise shape of modern languages begins to emerge in the process of refining the language system are also important to proposing a comprehensive perspective on evolution of modern language form. In this phase, the pressure to communicate more elaborate messages based on increasing sophistication in community social organization would drive the need for diversification of vocal capacities. Here the capacity for production of serially ordered vocalizations based on mandibular oscillation would provide a substrate for selection of a continually diverse set of sound-meaning links. We propose that the expansion of capacity from the rhythmic mandibular frame cycle toward vocal complexity, enabled by increasing independence of articulators from the jaw cycle, offers the opportunity to trace a precise path toward modern phonology.

F/C theory: Relations between phylogeny and ontogeny The Frame/Content (F/C) theory shares a relatively common conviction (e.g., Jackendoff, 2002) that the inaugural step for true language required the evolution of a combinatorial phonology whereby members of a relatively small subset of meaningless units are concatenated to form patterns for a large number of words. The resultant system can be described as “open” in that it can be added to, more or less indefinitely. Studdert-Kennedy & Lane (1980) have argued that this development was necessary to solve what they called the “impedance-matching” problem involving the sending of a large number of distinguishable words via a communication medium. They argued that as the number of holistic communicative signals increased, the point would soon be reached where reliable distinctions between similar signals would not be possible. The main question addressed by F/C theory

137

138

Peter F. MacNeilage and Barbara L. Davis

is how the vocal-auditory modality solved the problem of developing a combinatorial phonology whereby consonants and vowels are concatenated to form an indefinite number of words. According to F/C theory, four steps are required, three of which preceded the ultimate open state. First was evolution of the mouth close-open alternation, a precursor to the canonical consonant-vowel (CV) syllable of language. This alternation evolved with paleomammals (circa 200 million years ago) in the form of cyclical mandibular movements associated with ingestion (e.g., chewing) in our proposal. Next, the rhythmic cycle was exapted for visuofacial communication in the form of lipsmacks, tonguesmacks, and teeth chatters. These forms are commonly observable in nonhuman primates (Redican, 1975; Van Hooff, 1962; 1967). Each involves mandibular oscillation, and most likely shares a single rhythm generator at the neural level. Both of these initial steps properly belong to the phase of precursor behaviors, as they are not explicitly related to language. Subsequently, the action of this common rhythm generator was paired with phonation to form protosyllabic “Frames” for speech (e.g., “bababa”). This event may have been induced in the context of developments suggested by Dunbar (1996). He begins by noting that manual grooming is an important social bonding device in many primate relatives. He argues that as group size increased in hominid evolution, maintaining alliances at a hands-on level became impractical and consequently manual grooming was replaced by vocal grooming. The eventual outcome of this transition was the evolution of gossip as a means of social information exchange. Lipsmacks are an important communicative accompaniment to manual grooming (Redican, 1975) and may have evolved their communication function after becoming actions anticipatory to putting particles gleaned in the grooming process into the mouth (Van Hooff, 1962; 1967). The addition of phonation to these mandibular cyclicities, yielding protosyllabic sequences, could have been a facilitator of the transition to vocal grooming as suggested by Dunbar. This step is related to the proximal trigger phase where the connection to language form is logically apparent. Finally, the frame became programmable with individual consonants and vowels (“Content” elements) making available forms such as “bodega.” This phase, one of refining the language system, occurred in the period in which the precise shape of modern languages begins to emerge in evolution. The existence of this final Frame/Content stage is indicated in contemporary speakers by serial-order errors of speech such as spoonerisms. In these errors, individual consonants and vowels exchange position while leaving the original syllable structure (Frame) intact (e.g., “sommon kense” for “common sense”). The presence of a frame/content

The Frame/Content theory of evolution of speech

mode of organization of modern speech is not controversial. As Levelt (1992) has said: “Probably the most fundamental insight from modern speech error research is that a word’s skeleton or frame and its segmental content are independently generated. (p. 10). Our present view is that the independent segmental ‘Content’ component gradually evolved, just as it seems to gradually develop in speech acquisition (MacNeilage & Davis, 1990). The proposal that the CV sequence is the prototypical syllable form is motivated by the fact that the CV form is the only universal syllable form in languages (Maddieson, 1999), and that the form is dominant in the babbling and early speech of infants (Davis & MacNeilage, 1995; Davis, MacNeilage, & Matyear, 2002). Beyond this general level, the theory is based on a more detailed comparison of the serial organization of infant speech-related output and serial-organization patterns of modern languages. We initiated our program of work on speech ontogeny in the hope that it would illuminate the stages of speech phylogeny. Our conclusion from this body of data evaluating our predictions is that patterns that are held in common between infants and modern languages are fundamental to speech (primarily for basic biomechanical reasons). Patterns that differ between infants and modern languages provide evidence for how speech evolved from simple patterns to the more complex patterns employed by contemporary speakers. F/C theory has no implication for the process by which single word production evolved into syntax. In this respect, our theory differs little from Arbib’s. Although he suggests that single words might have been differentiated out of manual gestures that convey syntactic information, he does not include predictions related to the evolution of syntactic capacities of modern speakers. Our research on speech acquisition has been reported in a large number of publications in the last decade or so (e.g., Davis & MacNeilage, 1995; Davis, MacNeilage, & Matyear, 2002; Gildersleeve-Neumann, 1998; Matyear, MacNeilage, & Davis, 1997; MacNeilage & Davis, 1990, 1993; Redford, MacNeilage & Davis, 1997). A major finding is that three particular patterns of consonant-vowel (CV) co-occurrences, illustrated in Figure 2, are most prominent in babbling and early words. They are: (1) coronal (tongue front) consonants tend to co-occur with relatively high front vowels; (2) dorsal (tongue back) consonants tend to co-occur with relatively high back vowels; (3) labial (lip) consonants tend to co-occur with central vowels. Analysis of language patterns (MacNeilage et al., 2000a), extended by others (Rousset, 2003), has shown that modern languages also tend to have these three patterns, though not universally (73% of instances, in our study of 10 languages — MacNeilage et al, 2000a). These patterns have a biomechanical basis in inertia. As inertia of structures is inherent since their origin, the implications of this finding are fundamental to

139

140

Peter F. MacNeilage and Barbara L. Davis

Figure 2. A schematic view of the articulatory components of the speech apparatus in which the three arrows symbolize the three CV co-occurrence patterns. (The underlined vowel in “dada” is the vowel in “dad”.) (Science: with permission.)

the evolution of speech. For the coronal and dorsal patterns (1 and 2), the tongue tends to remain in the front and the back of the mouth, respectively, for the syllable. For the labial pattern (3), tongue positioning is not required for the consonant, and therefore the tongue apparently remains more or less stationary here too (in its rest position) for the syllable. For babbling and early words, the infant tends to repeat the same CV sequence (e.g., “bababa” or “mamama”). We have shown that inertia limits tongue movement from the consonant to vowel. As the same consonant follows the vowel in these sequences, the same inertial effect from vowel to consonant as from consonant to vowel is predicted. Accordingly we have found that the three CV co-occurrence constraints are accompanied by three corresponding VC co-occurrence constraints. Thus the biomechanical constraint against tongue-position change tends to pervade the entire utterance in early speech acquisition. Modern languages do not favor repetitions of the same syllable. Instead they have a strong constraint against it (described by generative phonologists as the “Obligatory Contour Principle” [Kenstowicz, 1994]). To understand how this profound change might have occurred one should first note that the repetitive CV forms of infant should not be regarded as syllable sequences in psycholinguistic terms for one important reason. Successive CVs are presumably not under

The Frame/Content theory of evolution of speech

separate control by the infant, beyond the ability to immediately produce another instance of the same one, following the previous one. This can be shown by first singling out, in individual infants, a frequently occurring labial-central CV form (e.g., “bababa”) and a popular coronal-front CV forms (e.g., “deedeedee”). These sequences can be considered form over which the infant has good control. However, virtually no “badee” or “deeba” forms are found, suggesting that at the level of serial organization infants have virtually no ability to control CV forms as independent building blocks of output. Consequently CV forms in babbling and early speech should be regarded as “protosyllables” in the sense that they are an initial form that gives rise to syllables in development, just as such protosyllabic forms probably gave rise to syllables in evolution. In contrast to the situation in infants, modern language production by adults, according to Levelt (1999), involves a “syllabary” consisting of several hundred independent syllabic units available for forming multisyllabic sequences. An indication of how this radical change in organization might have occurred, let us further consider co-occurrence patterns between adjacent segments in infants and in adult languages. Note again that infants have both CV and VC co-occurrence patterns. It is generally conceded that the most frequent syllable boundary in languages is between the vowel and the following consonant. Our research on modern language patterns (MacNeilage et al., 2000a, replicated by Rousset, 2003) has shown that in contrast to infants, the three VC co-occurrence constraints are not present in languages when there is a syllable boundary between the V and the C. (e.g., in words like “bi/nary”) (MacNeilage et al, 2000b). However when there is not a syllable boundary between the V and the following C (e.g. in “bin/ding” the same three co-occurrence patterns found in infant VC patterns are found in adult languages. Thus in the evolution of the structure of modern languages the basic biomechanical constraint against tongue-position change has been overcome by adult speakers at syllable boundaries. We consider the presence of the biomechanical constraints existing between Vs and following Cs in infants — but not across V/C syllable boundaries in languages — to provide important evidence of how the syllable evolved as a separate control unit in the history of languages. A drive for such change could be found in sociocultural pressures for increase the size of the communicable lexical message set. How did languages progress historically from the putative original pattern of same- syllable repetition to variegation of successive syllables? Here again, infant patterns offer a clue. The most favored non-repetitive pattern in early infant words (a pattern not present in babbling) is to begin a word with a labial consonant and then, after the vowel, add a coronal consonant (MacNeilage et al., 2000b) — e.g., “bado” for bottle. We have argued that this pattern is a self-organizational result of (primarily) the relative ease of labial consonant production interacting with

141

142

Peter F. MacNeilage and Barbara L. Davis

contingencies associated with output initiation in motor systems in general (MacNeilage & Davis, 2000). We have also found this pattern to be highly characteristic of modern languages (MacNeilage et al., 2000b, again replicated by Rousset, 2003). We postulate that the same relative ease of labial consonants interacting with contingencies associated with initiation of output may have caused our ancestors to produce more forms of this type as the productive lexicon for communication needed to grow for increasingly complex message transmission in more complex social circumstances.

The Origin of the Word Up until now, our theory has been more limited that Arbib’s, in that we have not considered the question of how signals might have first been paired with meanings. We have just begun to sketch out a scenario for the origin of sound-meaning pairings consistent with our contention that initial hominid words might have been similar to vocal forms observed in modern infant’s babbling. (MacNeilage & Davis, 2004a,b). This proposal is based on the frequent suggestion, recently reiterated by Falk (2004), that the first words may have originated in the linguistically marginal genre of Baby Talk. Falk begins by noting “the trend for enlarging brains in late australopithecines/early Homo progressively increased the difficulty of parturition, thus causing a selective shift towards females that gave birth to relatively undeveloped neonates.” As these neotenous infants were not sufficiently mature to cling to their mothers, as great-ape infants are, mothers needed to put their babies down while foraging. In Falk’s view, the resultant need for parental care at a distance created selection pressures for an elaboration of the dyadic vocalcommunication pattern. In this context, Falk argues, partly on the basis of a suggestion by MacNeilage (2000), that a maternal term such as “Mama” might have emerged from the characteristic tendency, well documented in modern infants by Goldman (2001), for infants to produce nasal vocalizations as a demand signal (e.g., a whiny “m-m-mm-m-m-m”). While Falk accords the coining of the “mama” word to the baby, we believe the mother must have first cognitively designated it as a symbol for herself and then used it accordingly with other adults. It is consistent with Falk’s suspicion that Baby Talk words for “mother” tend to contain nasal consonants (e.g. “m” or “n”). For example, we noted that all the words for the maternal parent in a corpus of Baby Talk words in six languages accumulated by Ferguson (1964) had a nasal consonant. However, importantly, no words for the male parent contained a nasal consonant. Perhaps parents produced the

The Frame/Content theory of evolution of speech

first true linguistic contrast by adopting a convention of naming the male parent with an already available but distinctively different nonnasal frame series such as “papa” (or [daedae]). Conscious reflections on these pairings could logically have led to the “Nominal Insight” — the insight that things in general could be given names (McShane, 1979). Thus, the invention of the word! According to this view first words might have been parental terms formed in the Baby Talk context of parent–infant interaction. As Jakobson (1962) originally suggested, such words or variants may then have typically been incorporated into true languages, often in modified but not totally changed form. Consistent with this possibility, Murdock (1959) found in a corpus of 474 languages that 78% of the consonants in the first syllable of maternal parent terms had a nasal consonant but only 32% of the first syllable consonants in paternal terms had one. In addition, we found that these syllables exhibited the three CV co-occurrence constraints at levels comparable to those of modern infants and above the levels typical of modern languages in general (MacNeilage & Davis, 2004a). This would be expected if they had an origin in baby talk, and played a fundament role in the evolution of the capacity to make words.

Ontogeny recapitulates phylogeny? At this stage, let us explicitly consider a central proposition of F/C theory, namely, that in some respects speech ontogeny recapitulates phylogeny. Such a consideration is necessary because current dogma tends to encourage a reflexive condemnation of this principle (e.g., Medicus, 1992), which might lead many to discount our theory without even considering it. To begin with, it is common in phylogeny for systems to go from simple to more complex, and as Jacob (1977) pointed out, with greater complexity, history plays a bigger role, and that history tends to be reflected in developmental sequences. Von Baer’s thesis that the ontogeny of bodily form involves recapitulation of a sequence of ancestral juvenile forms is widely accepted (See Gould, 1977). It is readily observable that speech ontogeny develops from simple to more complex. Beyond this, three of our findings lead to our espousal of the recapitulationist position. The first is that the mandibular cycle, which is a virtually exceptionless phenomenon in modern syllable production, and is always present in languages in its simple CV form (Bell & Hooper, 1978), occurs in the same simple form in the first truly speech-like vocalizations of infants. That this form tends to be highly rhythmic from its ontogenetic onset testifies to its deep-seated phylogenetic status. Further testimony to this status comes from (a) the likelihood that biphasic cycles

143

144

Peter F. MacNeilage and Barbara L. Davis

are the main way that animals get movement-related work done (MacNeilage, 1998), and (b) the fact that evolution tends to be conservative, favoring tinkering with already existing capabilities, including existing biphasic cycles, rather than inventing new capabilities from scratch (Jacob, 1977). The likelihood that mandibular cycles were involved in the first syllables and the fact that they are present in the first speech-like utterances of infants is one source of our recapitulationist stance. The second set of findings leading us to this stance is the ubiquitous presence in all speech-like behavior of one obviously biomechanical constraint on syllable production, namely, the constraint on the co-occurrence of consonants and following vowels — a constraint against transitional tongue movements. We have not yet found either an infant or a language that lacks all three of the consonant-vowel co-occurrence patterns we described earlier. Biomechanical inertia is a basic property of matter, and therefore of biological organisms. It is necessarily involved in changes in position in movement-control systems whenever and wherever these changes occur. If we find an aspect of an action system with an extremely ancient basic design that, in spite of its ultimately versatile modern use, is subject to a particular manifestation of inertia, we must conclude that it always has been subject to it. Here again, then, a property we confidently attribute to the first syllables of hominids is also present in the first syllable-like productions of infants. In making this claim for a ubiquitous role of inertia dating from the first phylogenetic and ontogenetic manifestations of speech-like behavior, we align ourselves with a long tradition in biology. The notion originated with the German school of “Naturphilosophie” (cf. Gould, 1977) and was further developed by Hertwig (e.g., Hertwig, 1901). According to this tradition, the laws of physics (in our case, biomechanics) and chemistry exert “Common Constraints” on the phylogeny and ontogeny of form. As we see it, physical properties of the production apparatus have inevitable consequences for function. (See Lock & Peters, 1996, pp. 373–4 for discussion.) The third set of findings applies not to the putative first mandibular oscillations of speech but to subsequent ones. The same biomechanical inertia operative in CV sequences ought, on simple biomechanical grounds, to be equally present when a speaker produces a VC sequence. This effect of inertia is observable in babbling (Davis & MacNeilage, 1995). But in languages it tends to be absent between a vowel and following consonant in cases in which there is a syllable boundary. We infer from this patterning that it took time for speakers to overcome this basic biomechanical constraint. Infants recapitulate that trajectory in exhibiting the VC constraint in serial vocal patterns in babbling (Davis & MacNeilage, 1995) and in early single words (Davis et al., 2002) but eventually overcoming it, as the language they are learning demands an increase in versatility and as users develop mature speech production skill.

The Frame/Content theory of evolution of speech

In summary, we assert that the ontogeny of speech allows a window on the nature of its phylogeny. In both ontogeny and phylogeny, sound patterns are characterized by the same two-stage sequence. The first stage is/was one in which both are highly subject to basic constraints of biomechanical inertia. The second stage is/was one of partially overcoming these constraints in the course of developing lexical openness. Note, though, the difference in causes of the second stage in the two instances. While infants have to learn the particular manifestation of lexical openness in their community, earlier hominids had to invent it.

Neurological aspects of F/C theory The neurological component of F/C theory is consistent with the fact that speech production in modern adults involves the joint role of two action sub-systems in higher primates (e.g., Goldberg, 1985) — an “Extrinsic” system and an “Intrinsic” system (MacNeilage, 1998). The “Extrinsic” system is susceptible to external control. It involves much of the posterior cortex and Ventral Premotor Cortex, including Broca’s Area. This system includes the “mirror neurons” discovered by Rizzolatti and colleagues (Rizzolatti et al. 1996: Rizzolatti, Fogassi & Galese, 2001) which discharge when an individual performs an action (manual or vocal) and when the individual sees/hears another individual perform the same action (hence Extrinsic control). This system is considered to be relevant to the learning and online execution of segmental content. The “Intrinsic” system is for self-generated action. It includes the Supplementary Motor Area (SMA). This system is considered to be responsible for frame generation in spontaneous speech. Evidence for this proposal includes not only studies of electrical stimulation and irritative lesions of the SMA but also the actions of a subclass of global aphasics (including Broca’s original patient “Tan”) who are apparently subject to a “release” phenomenon affecting the SMA (see MacNeilage & Davis, 2001, for a review). In all three instances, patients involuntarily generate repetitive strings of CV syllables (e.g., “babababa”) dubbed “Non-Meaningful Recurrent Utterances” by neuropsychologists (e.g., Code, 1994). An important fact underlying our perspective is that the phylogenetic precursor to part of Broca’s area (area 44) and cortex immediately posterior to it (area 6) is the main cortical site for the control of ingestive cyclicities (chewing, sucking, licking) in mammals. According to F/C, frames initially derived from ingestive mandibular cyclicities controlled in VPM (MacNeilage, 1998). But the evidence cited above suggests that frame control for spontaneous communicative acts eventually shifted to the Intrinsic system, particularly the SMA.

145

146

Peter F. MacNeilage and Barbara L. Davis

Since the work of Penfield (e.g., Penfield & Roberts, 1959), the SMA has been known to play a role in language production. Its participation in speech production has been repeatedly confirmed by brain-imaging (see Indefrey & Levelt, 2000, for a meta-analysis of imaging studies of speech production). Yet there has been a reluctance to depart from a focus on lateral frontal cortex as the main motor cortical locus for speech output organization. See, for example, the contentions of Abbs & Paul (1998), Lund (1998), and Jurgens (1998) that lateral cortex is the only cortical seat of speech output control. This reluctance, shared by Arbib, has been exacerbated by the momentous discovery of mirror neurons in lateral cortex. But it comes at the cost of neglecting the evolution of the intrinsic control of speech.

Arbib’s gestural origins alternative We see gestural scenarios as having one particular strong point. They offer a natural way for meanings to become linked with signals, namely, via iconicity. Signs can look like the objects, actions, and attributes that they indicate. And it is intuitively plausible that earlier hominids might have spontaneously used such signs to communicate. For Arbib (see this volume) the evolution of iconic communication in the form of “pantomime” was a key event in the progression towards language. He sees it as a “precursor” or “scaffolding” for his proto-sign stage. The proto-sign stage requires the further steps of conventionalizing iconic pantomime signals and adding further conventional signals for meanings that don’t lend themselves to iconicity. While we agree that the iconic property of pantomime has a natural advantage in allowing a recipient to link a signal with a meaning, we don’t believe that this role necessarily made pantomime a linguistic precursor to the spoken medium of language. Instead, we favor Goldin-Meadow & McNeill’s (1999) conception that the communicative value of iconic gestures resulted in their retention in modern communicators, as a supplement to spoken language. One main goal of the F/C theory is to explicate ways in which lexical openness was enabled by the evolution of combinatorial phonology. Arbib’s treatment of the openness issue suggests that the protosign stage is open (personal communication), quoting, with approval, Stokoe’s (2001) assertion that “the power of pantomime … provides open-ended [italics ours] communication that works without prior instruction or convention.” However, pantomime is not based on combinations of a limited number of meaningless gestural subcomponents — it does not have a phonology. Thus, Arbib makes an assumption contrary to that of Studdert-Kennedy & Lane, and many others, that openness was acquired without

The Frame/Content theory of evolution of speech

this combinatorial capacity? If pantomime already gave us a capacity to expand the number of words indefinitely, Arbib needs to include an explanation of why we have in modern sign languages a different kind of openness, one which is achieved by combinations of three meaningless parameters of location, handshape, and movement. There is a stark contrast between our proposal and Arbib’s in that ours involves a systematic attempt to explain the origins of the present day structure of the vocal medium. His does not extend to explanation of the present structure of sign language. This lack of continuity in Arbib’s proposal is problematical because present day linguistic structure of the two media constitutes the most solid evidence we have about their evolution. This is underscored by the emphasis that is currently placed on reverse engineering in the understanding of the evolution of complex design (e.g. Pinker, 1997). Arbib’s position on the evolution of openness leaves him with two extremely serious problems. Hockett (1958) regarded openness is an essential part of the definition of true language, and suggested that if a gestural mode of true language had been accompanied by openness, it would have been a sufficient achievement to ensure that signed language would still be the language of choice. But sign language is not the language of choice today in any extended culture. The kinds of factors cited as leading to the abandonment of an earlier gestural language — e.g., that it is not omni-directional, does not work in the dark, and does not leave the hands free — are altogether too puny to have led to the demise of such a powerful adaptation. Arbib’s gestural perspective shares the problem of linking sounds with meanings — the problem that we have begun to address It can be called the Intermodal Translation Problem. His version of the problem is the need to explain how early gesture-meaning pairs became linked with particular sound patterns in the course of a switch to a vocal-auditory from a manual-visual system. The relation between sound patterns and words of modern languages is predominantly arbitrary. This arbitrariness is as big a problem for a conception of sign-to-speech translation as it is for a vocal origins scenario. Gordon Hewes (1996), the main proponent of a gestural-origins theory in the 20th century, has confessed, “The ideas about the movement from a postulated pre-speech language to a rudimentary spoken one are admittedly the weakest part of my model” (1996, p. 589). Arbib’s “spiral” metaphor for the relation between proto-sign and proto-speech may be verbally appealing, but it is supported only by sparse anecdotal examples. Arbib analyzes only two English sentences with numerous complex syllable structures to refute our well-documented claim that the CV form is the canonical syllable type in languages. However, the syllabic organization of English is exceptional. Maddieson’s (1999) survey of 30 diverse languages showed that 80% have

147

148

Peter F. MacNeilage and Barbara L. Davis

only CV forms or include minor complex syllable structure variants. For the other 20% of languages, we assert that the distinction between Frame and the Frame/ Content stages lays the groundwork for the understanding of more complex syllable types. For example, CV co-occurrence constraints are present in the VC sequence of CVC syllables but not in VC sequences when a syllable boundary lies between the V and the C. This patterning tells us that biomechanical constraints are implicated in evolution of CVC syllables. This type of biomechanical principle for complex action sequences in the vocal domain produces predictions regarding the structure of CVC syllables in the world’s languages. In addition, English-speaking infants’ early consonant clusters show a similar biomechanical constraint against place-of-articulation change to the constraint against place change in simple CV syllables. This consistent constraint suggests that in the relatively few cases in which consonant clusters evolved in languages, they were initially subject to the same constraint (Jakielski, 1996; Jakielski, Davis, & MacNeilage, in preparation). Consequently, we predict that the smaller the number of consonant clusters in a language, the more the individual members of those clusters will share place of articulation. Arbib’s second criticism is that “the mandibular cycle is too far back to serve as an interesting way station on the path towards syllabic vocalization.” He asserts that it is “no more illuminating to root syllable production in mandibular oscillation than to root dexterity in swimming.” It is as if descent with modification is an acceptable tenet unless it is based on something that happened too long ago! Does this mean that Arbib would be equally refractory to the claim of Cohen (1988), relevant to his manual domain, that a central pattern generator underlying the flexion-extension cycle of locomotion in the limbs of living mammals, including our own forelimbs (e.g., as in hammering), could have originated in fish, half a billion years ago? Arbib does not acknowledge in his critique that a single biphasic rhythm instantiated by the mandible is totally basic to speech. We would assert that when and how manual as well as vocal precursors to speech originated is important, regardless of time frame. There is no ubiquitous rhythm generator in manual dexterity or in modern sign language corresponding to the frame, so obviously the phylogeny of biphasic extension-flexion pattern generators underlying swimming is not relevant. If either manual actions in general or manual gestures of sign language were accompanied by a single ubiquitous biphasic flexion-extension cycle similar to the mandibular cycle, then lamprey locomotion (circa half a billion years ago) would be highly relevant to their evolution. But they are not. Arbib does not concede the importance of the mandibular cycle for the entire phylogeny of speech. Instead, he wishes, for some reason, to put emphasis on concatenations of discrete movements while not conceding the possibility that they arose from syllable frames and are still integrally related to them.

The Frame/Content theory of evolution of speech

At the core of Arbib’s gestural-origins scenario is the existence of “mirror neurons” in the F5 region of monkey ventral pre-motor cortex, which discharge for hand movements and also discharge when the monkey sees a similar hand movement (Arbib & Rizzolatti, 1997; Rizzolatti & Arbib, 1998). Introduced in Stage S2 of his current model, they provide for Arbib an initial basis for “The ‘parity’ requirement for language in humans — that what counts for the speaker must count approximately the same for the hearer” (italics his). However, it is also known that a region of monkey F5 more lateral to the area containing most hand-related neurons in which there are neurons associated with mouth movements (Gentilucci et, al, 1988) is present. Rizzolatti and colleagues have now shown that there are mirror neurons in this region involved in both ingestive behaviors and visuofacial communicative behaviors including lipsmacks (Ferrari et al., 2003). Importantly, from our standpoint, neurons with a role in both ingestion and visuofacial communication, could have been precursor behaviors for visuofacial communicative activities. 11 of 12 “communicative mirror neurons” — neurons, which responded to species-specific communicative gestures, made by the experimenters — also discharged during the making of ingestive actions by the monkey. The neurons were perceptually sensitive to communicative gestures of lip smacking, teeth chatter, lip protrusion, tongue protrusion, and lip and tongue protrusion together. Associated ingestive actions were sucking, grasping with lips, chewing, reaching with tongue, grasping (with mouth), and grasping with lips. (The 12th neuron, which responded to the communicative gesture of lip protrusion, was active during the animal’s production of another communicative gesture — lipsmacking.) The authors state “In general there was a good correlation between the motor features of the effective observed (communicative) and those of the effective executed (ingestive) action” (Ferrari et al., 2003, p. 1709). They consider these findings to be consistent with the contention from F/C theory that ingestive actions may have formed a basis for the subsequent oral communicative role of this region. In their words, “Ingestive actions are the basis on which communication is built” (p. 1713). Arbib makes several attempts to minimize the significance of these findings for our thesis that language evolved in the vocal mode rather than in a gestural-tovocal sequence. This is difficult, as the communicative mirror neurons are integral to his stage of transition from the S5 proto-sign stage to the S6 proto-speech stage. In an attempt to counter the vocal-origins scenario, he emphasizes one difference between the manual and the oral mirror neurons. He points out that while manual mirror neurons discharge during execution and observation of the same function (e.g., precision grasping), for almost all the communicative mirror neurons “the observation and execution functions of these neurons are not strictly congruent

149

150

Peter F. MacNeilage and Barbara L. Davis

[italics ours] — the neurons are active for execution of ingestive actions, e.g., one ‘observed’ lip protrusion but ‘executed’ syringe sucking.” We would emphasize that in one of the two neurons for which the observation function was lipsmacking, the execution functions were sucking and lipsmacking. For the other neuron there was strict congruence — lipsmacking in, lipsmacking out. Thus for both of the neurons most crucial for our thesis that lipsmacks may have been precursors to proto-syllables, input and output functions were both communicative. We would also stress Ferrari et al.’s conclusion that a methodological problem lay behind the fact that only two of the 12 mirror neurons with communicative observation functions had communicative execution functions. The problem was the difficulty of evoking oral communicative gestures in the experimental situation. Despite this problem, they concluded that “we are inclined to think that the property to discharge during active communicative actions is not limited to the two abovementioned neurons but is rather a property of many, if not all, of them” (p. 1713). Another of Arbib’s caveats regarding the significance of communicative mirror neurons for a vocal-origins scenario is based on the fact that these neurons have not been demonstrated to have sensitivity to auditory input and are therefore “a long way from the sort of vocalizations that occur in speech” (Arbib, this volume). Arbib sees the evolution of auditory communicative functions in F5 as being more related to another class of mirror neurons, identified by Rizzolatti’s group. Kohler et al. (2002) have found neurons that respond to the sound of an action such as breaking peanuts or ripping paper, even when presented separately, as well as to the action itself. Arbib concludes from the existence of these audiovisuomotor mirror neurons that “the manual gesture is primary in the early stages of evolution of language readiness, with audio-motor neurons laying the basis for later extension of proto-sign to proto-speech” (Arbib, this volume). But if, as we believe, phonation was eventually added to lipsmacks to form proto-syllables, there is no reason why mirror neurons would not become sensitive to this new acoustical accompaniment of oral communication in a direct way. In fact, given that some visuo-facial communicative cyclicities (smacks, teeth chatters) already have oral acoustic correlates, and are sometimes even accompanied by vocalization (MacNeilage, 1998), there is no reason to believe that present-day monkeys do not already have communicative mirror neurons that are acoustically sensitive. In fact in Fogassi & Ferrari (this volume) describe one such neuron! Finally, we wish to make a general point regarding the implications of studies of mirror neurons in monkeys. Given that we probably shared a common ancestor with these monkeys as long ago as 40 million years, the relative frequencies of the various types of mirror neurons identified so far in monkey F5 should probably not be taken as indicating the relative likelihood of gestural versus vocal origins

The Frame/Content theory of evolution of speech

of human language. We are more comfortable with simply taking all of the mirror neuron variants as existence proofs of a range of early bases for various evolving capacities of the Extrinsic output system. We see two further shortcomings in Arbib’s scenario in the context of current thinking on the evolution of action systems. First is an imbalance in the amount of importance ascribed to the Extrinsic and Intrinsic systems. Arbib’s conception takes as its point of departure the discovery of mirror neurons in the monkey analog of Broca’s Area. This region is generally accepted as being part of an “Extrinsic” system, by which external stimuli (such as the seeing of another animal adopting a precision grip) can influence action. But Arbib fails in our view to give sufficient weight to the Intrinsic system, even though he does note the relevance of SMA and the basal ganglia to his account. The Extrinsic system is important, as it helps us understand how use of language media can be learned. But under natural conditions, actual linguistic utterances are typically self-generated. There is plenty of evidence that electrical stimulation of the Intrinsic system in humans produces not only sequences of CV syllables, but also complex output forms involving the rest of the body including the manual system. For example, Penfield & Welch (1951) reported finding three types of movements when stimulating the supplementary motor area: “(1) assumption of postures; (2) maneuvers such as stepping; and (3) rapid in-coordinate movements” (p. 310). As an example of the latter, they reported that “One patient (G.L.) showed a sudden rapid flexion of the contra-lateral elbow when a point in the supplementary motor area was stimulated. This response caused him to strike his nose with his hand” (Penfield & Welch, 1951, p. 311). Where, then, does the Intrinsic system fit, if at all, into Arbib’s conception? More specifically, what is the parceling of the roles of the two systems in a gestural-origins scenario, a scenario where, unlike the F/C hypotheses, there is no natural behavioral-level division of manual output into frame and content components. Two neurological findings are particularly relevant to this parceling operation. It is well known that a result of damage to the SMA is an inability to spontaneously initiate movements (Rubens, 1975). In addition Watson et al (1986) showed that patients with SMA damage were much less able to produce pantomime from verbal instruction than they were to imitate actions or use objects presented to them. Both findings emphasize the importance of the intrinsic system in the evolution of spontaneous voluntary action in general. There is a second shortcoming of Arbib’s scenario, related to the first. Arbib makes surprisingly little use of Merlin Donald’s extensive writings on the evolution of a general-purpose mimetic capacity (e.g., Donald, 1994; 1999). For Donald, “mimesis” is the equivalent of Arbib’s “pantomime.” While Arbib focuses only on

151

152

Peter F. MacNeilage and Barbara L. Davis

manual pantomime, in the service of his gestural-origins scenario, Donald argues, persuasively that a capacity for vocal-auditory mimesis must have evolved simultaneously with the mimetic capacity of the rest of the body before language evolved because otherwise we would not have been able to evolve the combinatorial structure of spoken language. Donald also argues, contra Arbib, that because mimesis was a general-purpose adaptation, the assumption of a specifically gestural linguistic precursor to spoken language is not a necessary one. Finally, Donald points out that for mimetic capacity to have evolved in hominids “they had to evolve direct executive governance over action. Attention had to be redirected inwards, away from the external world and toward their own actions.” (Donald, 2001, p. 270) This together with the fact that pantomime is always self-initiated suggests that much more emphasis should be put on the intrinsic action system in the evolution of use of the medium of language than Arbib has to date. Finally, it is of interest to explicitly compare the neurobiological component of Arbib’s conception and the F/C conception. While both approaches accept the proposition that mirror neurons must have played a crucial formative role in the learnability of input-output relations necessary for both gestural and vocal communication in modern hominids, Arbib, somewhat arbitrarily restricts the contribution of mirror neurons to vocal communication to a protospeech stage necessarily following a protosign stage. In addition, while the F/C conception makes the Intrinsic component of output control, centering on medial cortex of the SMA, central to voluntary output in general and to frame generation in particular, Arbib basically confines his model to the Extrinsic system

Summary F/C theory centers on the evolution of the lexical openness found in present-day languages. How did we come by the ability to concatenate a limited set of meaningless phonological elements (consonants and vowels) so as to make available an indefinitely large set of words? In our view, true language originated with an open lexical system employing the auditory-vocal medium for transmission. Ours is an ingestive-origins scenario grounded in precursor behaviors found in the mandibular oscillations of putative CV protosyllables (Frames). These cyclicities originating in ingestive behaviors, were used for visuofacial communication, and finally paired with phonation to form proto-syllables. In the proximal trigger phase, openness in the vocal-auditory mode evolved in two stages. The first was a “Frame” stage, enabling production of relatively uniform series of mandibular oscillations (e.g., “bababa”) paired with phonation. The

The Frame/Content theory of evolution of speech

second phase of refining the language system was a Frame/Content stage whereby different consonants and vowels were programmed into successive syllable frames (e.g., in “bodega”). We argue that the phylogeny of speech recapitulates its ontogeny in one specific way: both ontogeny and phylogeny necessarily involve an initial (Frame) stage dominated by inertial biomechanical constraints on the oral system, some of which are eventually superceded in a second (Frame/Content) stage, in service of the evolution of independent control of the syllable, providing the capability for the intersyllabic variegation required for truly versatile openness. We assert that an advantage of vocal-origins over gestural-origins theories of the phylogeny of medium use is that they can begin with the known outcome of language evolution — current speech — and use its linguistic structure, as seen in the course of ontogeny and in the structure of current languages, as a basis for inferring its phylogeny. We contrast our approach with that of Arbib, who takes no position on the contemporary phonological structure of either of these two mediums. Consequently he is not in a position to adequately consider an extremely vexing problem typically neglected in gestural-origins scenarios — the problem of how the structure of the gestural medium was translated into the structure of the vocal medium. Arbib’s theory capitalizes on the plausibility of iconicity as an early basis for the establishment of signal-meaning relationships in the evolution of language. He grounds his claim of an early gesturally based open system in a visual-gestural linkage capacity mediated by mirror neurons, and adds a proto-speech stage to the proto-sign stage of iconic language to evolve a contemporary auditory-vocal language medium. We concur that there is an iconicity advantage of the gestural-visual over the vocal-auditory medium in scenarios for the origin of signal-meaning relationships. However, iconicity does not result in a pre-vocal open lexical stage of gestural language but in the non-linguistic use of gestural iconicity as a communicative supplement to language, a use still present today. Our main rationale for suggesting a nonlinguistic track for gestural phylogeny is that if we had ever achieved an open gestural linguistic system, as Arbib claims, the resultant signed language would have been such a momentous adaptation that the visuo-gestural medium would still be the predominant medium today. The gestural-visual advantage for understanding the origin of signal-meaning relationships (iconicity) has a mirror image in the form of a problem, for all language-origins theories. It is the problem of addressing the origin of sound-meaning pairings in light of the present-day arbitrariness of sound-meaning relationships. We present the hypothesis that the Baby Talk domain has both ecological structure and, according to F/C theory, the phonetic underpinnings requisite to the advent of the pairing of sounds with meanings to form the first words in the form of parental

153

154

Peter F. MacNeilage and Barbara L. Davis

terms, terms that still exhibit a phonetic nasal (mother)-oral (father) contrast across languages, and the fundamental CV co-occurrence patterns. In addition, we point out, with Rizzolatti and his colleagues, that the existence of mirror neurons associated with both ingestion and oral communication (typically both in a single neuron) is consistent with the aspect of F/C theory which deals with precursors to the simple Baby Talk words with which language might have begun.

References Abbs, J. H., & DePaul, R. (1998). Motor cortex fields and speech movements: Simple dual control is implausible. Behavioral and Brain Sciences, 21, 511–512. Arbib, M. A., & Rizzolatti, G. (1997). Neural expectations: a possible evolutionary path from manual skills to language. Communication and Cognition, 29, 393–424. Bell, A., & Hooper, J. B. (Eds.) (1978). Syllables and segments. Amsterdam: North Holland. Code C. (1994). Speech automatism production in aphasia. Journal of Neurolinguistics, 8, 149– 156. Cohen, A. H. (1988). Evolution of the central pattern generator for locomotion, In A. Cohen, S. Rossignol & S. Grillner (Eds.) Neural control of rhythmic movements. New York: Wiley. Darwin, C. (1859). The origin of species. London: John Murray. Davis, B. L. & MacNeilage, P. F. (1995). The articulatory basis of babbling. Journal of Speech and Hearing Research, 38, 1199–1211. Davis, B. L., MacNeilage, P. F. & Matyear, C. L. (2002). Acquisition of serial complexity in speech production: A comparison of phonetic and phonological approaches. Phonetica, 59, 75– 107.. Donald, M. (1994). Origin of the modern mind. Cambridge, MA: Harvard University Press. Donald, M (2001). A mind so rare: The evolution of human consciousness. New York: Norton. Dunbar, R. I. M. (1996). Grooming, gossip and the evolution of language. London: Faber and Faber. Falk, D. (2004). Prelinguistic evolution in early hominins: Whence motherese. Behavioral and Brain Sciences, 27, 491–503. Ferguson, C. A. (1964). Baby talk in 6 languages. American Anthropologist, 66, 103–114. Ferrari, P. F., Gallese, V., Rizzolatti, G., & Fogassi, L. (2003). Mirror neurons responding to the observation of ingestive and communicative mouth actions in the monkey ventral premotor cortex. European Journal of Neuroscience, 17, 1703–1714. Gentilucci, M., Fogassi, L., Luppino, M., Matelli, R., & Rizzolatti, G. (1988). Functional organization of inferior area 6 in the macaque monkey I: Somatotopy and the control of proximal movements. Experimental Brain Research, 71, 475–490. Gildersleeve-Neumann, C. Davis, B. L., & MacNeilage, P. F. (2000). Contingencies governing the acquisitive of fricatives, affricates and liquids. Applied Psycholinguistics, 21, 341–363. Goldberg G. (1985). Supplementary motor area structure and function: Review and hypothesis. Behavioral and Brain Sciences, 1985, 8, 567–616. Goldin-Meadow, S. & McNeill, D. (1999). The role of gesture and mimetic representation in making language the province of speech. In M. C. Corballis & S. E. G. Lea (Eds.) The descent of man (pp. 155–172) Oxford: Oxford University Press.

The Frame/Content theory of evolution of speech

Goldman, H. I. (2001). Parental reports of ‘MAMA’ sounds in infants: An exploratory study. Journal of Child Language, 28, 497–506. Gould, S. J. (1977). Ontogeny and phylogeny. Cambridge, MA: Belknap. Hertwig, O. (1901). Einleitung und allgemeine Literaturubersicht. In O. Hertwig (Ed.) Handbuch der vergleichenden und experimentellen entwickelüngslehre der wirbeltiere, Vol. 1, part 1. Jena: Gustav Fischer, pp 1–85. Hewes, G. (1996). A history of the study of language origins and the gestural primacy hypothesis. In Lock, A. & Peters, C. R. (Eds.) Handbook of human symbolic evolution (pp. 571–595) Oxford: Clarendon Press. Hockett, C. F. (1978). In search of Jove’s brow. American Speech, 53, 243–319. Indefrey, P., & Levelt, W. J. M. (2000). The neural correlates of language production. In M. S. Gazzaniga, (Ed.) The new cognitive neurosciences, 2nd Edition (pp. 845–866) Cambridge, MA: Bradford. Jackendoff, R. (2003). Foundations of language: Brain, meaning, grammar, evolution. Oxford: Oxford University Press. Jacob, F. (1977). Evolution and tinkering. Science, 196, 1161–66. Jakielski, K. J. (1996). The acquisition of consonant clusters. Unpublished Ph.D. Dissertation, University of Texas at Austin. Jakielski, K. J., Davis, B. L., & MacNeilage, P. F. Place-change restriction in early consonant clusters. (In Preparation). Jakobson, R. (1960). Why “Mama” and “Papa”. In B. Caplan & S. Wapner (Eds) Essays in honor of Heinz Werner (pp. 124–134). New York: International Universities Press. Jurgens, U. (1998). Speech evolved from vocalization, not mastication. Behavioral and Brain Sciences, 1998, 21,519–520. Kenstowicz, M. (1994). Phonology in generative grammar. Cambridge, MA: MIT Press. Kohler, E., Keysers, C., Umiltà, M. A., Fogassi, L., Gallese, V., & Rizzolatti, G. (2002). Hearing sounds, understanding actions: action representation in mirror neurons. Science, 297, 846– 848. Levelt, W. J. M. (1992). Accessing words in speech production: Stages, processes and representations. Cognition, 42, 1–22. Levelt, W. J. M. (1999). Producing spoken language: A blueprint of the speaker. In C. M. Brown and P. Hagoort (Eds) The neurocognition of language (pp. 83–122) Oxford: Oxford University Press. Lieberman, P. (1984). The biology and evolution of language. Cambridge, MA: Harvard University Press. Lock, A., & Peters, C. R. (Eds.) (1996). Handbook of human symbolic evolution. Oxford: Clarendon Press. Lund, J. P. (1998). Is speech just chewing the fat? Behavioral and Brain Sciences, 21, 522. MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. Behavioral and Brain Sciences, 21, 499–546. MacNeilage, P. F. (2000). The explanation of “mama”. Behavioral and Brain Sciences, 23, 440–441. MacNeilage P. F., & Davis, B. L. (1990). Acquisition of speech: Frames, then content. In M. Jeannerod (Ed.) Attention and performance X111 (pp. 453–476) Hillsdale NJ: Erlbaum. MacNeilage, P. F., & Davis, B. L. (1993). Motor explanations of babbling and early speech patterns. In B. de Bardies, S. de Schonen, P. Jusczyk, & P. F. MacNeilage, (Eds.) Developmental neurocognition: Speech and face processing in the first year of life (pp. 341–352). Dordrecht: Kluwer.

155

156

Peter F. MacNeilage and Barbara L. Davis

MacNeilage, P. F. & Davis, B. L. (2000) Origin of the internal structure of word forms, Science, 288, 527–531. MacNeilage, P. F., & Davis, B. L. (2001). Motor mechanisms in speech ontogeny: phylogenetic, neurobiological and linguistic implications. Current Opinion in Neurobiology, 11, 696–700. MacNeilage, P. F., & Davis, B. L. (2004). Baby talk and the origin of the word. Paper presented at the Fifth International Conference on the Evolution of Language, Leipzig, Germany, April, 2004. MacNeilage, P. F., & Davis, B. L. (2004). Baby talk and the emergence of first words. Behavioral and Brain Sciences, 27, 517–18. MacNeilage, P. F., Davis, B. L., Kinney, A., & Matyear, C. L. (2000a) The motor core of speech: a comparison of serial organization patterns in infants and languages. Child Development 71: 153–163. MacNeilage, P. F., Davis, B. L., Kinney, A., & Matyear, C. L. (2000b) Origin of serial output complexity in speech. Psychological Science, 10, 459–460. Maddieson I. (1999). In search of universals. Proceedings of the 14th International Congress of Phonetic Sciences, Vol 3, San Francisco, California, August, 1999: 2521–2528. Matyear, C. L., MacNeilage, P. F., & Davis, B. L. (1998). Nasalization of vowels in nasal environments in babbling: Evidence for frame dominance. Phonetica, 55, 1–17. McShane, J. (1979). The development of naming. Linguistics 17: 879–905. Medicus, G. (1992). The inapplicability of the biogenetic law to behavioral development. Human Development, 35, 1–8. Murdock, G. P. (1959). Cross-language parallels in parental kin terms. Anthropological Linguistics, 1, 1–5. Penfield, W. & Roberts, L. (1959). Speech and brain mechanisms. Princeton: Princeton University Press. Penfield, W., & Welch, K. (1951). The supplementary motor area of the cerebral cortex: A clinical and experimental study. AMA Archives of Neurology and Psychiatry, 66, 289–317. Pinker, S. (1997). How the mind works. New York: Norton. Radinsky, L. B. (1987). The evolution of vertebrate design. Chicago: University of Chicago Press. Redford, M. A., MacNeilage, P. F., & Davis, B. L. (1997). Production constraints on utterance-final consonant characteristics in babbling. Phonetica, 54, 172–186. Redican, W. K. (1975). Facial expressions in nonhuman primates. In L. A. Rosenblum (Ed.) Primate behavior: Developments in field and laboratory research, Vol. 4. (pp.103–194). New York: Academic Press. Rizzolatti, G., & Arbib, M. A. (1998). Language within our grasp, Trends in Neurosciences, 21, 188–194. Rizzolatti, G., Fadiga L., Gallese, V., & Fogassi, L. (1996). Premotor cortex and the recognition of motor actions. Cognitive Brain Research, 3, 131–141. Rizzolatti, R., Fogassi, L., & Gallese, V. (2001). Neurophysiological mechanisms underlying the understanding and imitation of action, Nature Reviews: Neuroscience, 2, 66 –670. Rousset, I. (2003). From lexical to syllabic organization: Favored and disfavored co-occurrences. Proceedings of the 15th International Congress of Phonetics, Barcelona, August, 2003. pp. 2705–2708. Rubens, A. B. (1975). Aphasia with infarction of the territory of the anterior cerebral artery. Cortex, 11, 239–50.

The Frame/Content theory of evolution of speech

Studdert-Kennedy, M. G., & Lane, H. (1980). Clues from the difference between signed and spoken languages. In Bellugi, U., & Studdert-Kennedy, M. G. (Eds) Biological constraints on linguistic form (pp. 29–40) Berlin: Verlag Chemie. Stokoe, W. C. (2001). Language in hand: Why sign came before speech. Washington, DC: Gallaudet University Press. Tomasello, M. (2003). Constructing a language: A usage based theory of language acquisition. Cambridge: Harvard University Press. Van Hooff, J. A. R. A. M. (1962). Facial expressions in higher primates. Symposia of the Zoological Society of London, 8, 97–125. Van Hooff, J. A. R. A. M. (1967). Facial displays of the catarrhine monkeys and apes. In D. Morris (Ed.) Primate ethology (pp. 7–68) London: Weidenfield and Nicholson. Watson, R. T., Fleet, W. S., Gonzales-Rothi, L., & Heilman, K. M. (1986). Apraxia and the supplementary motor area. Archives of Neurology, 43, 787–92. Wray, A. (Ed.) (2003). The transition to language. Cambridge: Cambridge University Press.

About the authors Peter MacNeilage received his Ph.D. in psychology from McGill University in 1962, and has been a research associate at Haskins Laboratories, and taught at Barnard College, the University of California at Berkeley, and (presently) at the University of Texas at Austin. He has written over 100 scientific papers, more recently on the evolution of complex action systems — manual and vocal. He is a fellow of the American Association for the Advancement of Science, the Acoustical Society of America, the International Neuropsychological Symposium, and the Center for Advanced Studies in the Social and Behavioral Sciences. Barbara Davis is a professor in the Department of Communication Sciences and Disorders at The University of Texas at Austin. She has published over 70 articles and book chapters and is the winner of a variety of research and teaching awards. Davis’ research is focused on the interactive influences of production system capacities and perceptual influences in acquisition of speech production skill. She has participated in studies comparing infant production data with vocal tract growth simulations. She has also considered ways in which data on serial patterns in earliest phases of infant speech acquisition may properly inform understanding of the evolution of the speech production capacity in ancestral speakers. Her research includes studies of typically developing infants in English and diverse language environments, and infants receiving early cochlear implant as well as comparative studies of serial patterning in languages.

157

Intentional communication and the anterior cingulate cortex Oana Benga Department of Psychology, Babeş-Bolyai University Cluj-Napoca

This paper presents arguments for considering the anterior cingulate cortex (ACC) as a critical structure in intentional communication. Different facets of intentionality are discussed in relationship to this neural structure. The macrostructural and microstructural characteristics of ACC are proposed to sustain the uniqueness of its architecture, as an overlap region of cognitive, affective and motor components. At the functional level, roles played by this region in communication include social bonding in mammals, control of vocalization in humans, semantic and syntactic processing, and initiation of speech. The involvement of the anterior cingulate cortex in social cognition is suggested where, for infants, joint attention skills are considered both prerequisites of social cognition and prelinguistic communication acts. Since the intentional dimension of gestural communication seems to be connected to a region previously equipped for vocalization, ACC might well be a starting point for linguistic communication. Keywords: anterior cingulate cortex, intentionality, communication, social cognition

Alternative scenarios have been proposed for the primate brain evolution that would have enabled human cognitive functions like language. A “bottom-up” version assumes that pressures of new physically demanding environments promoted changes in sensory and motor peripheral systems, and consequently, in those cortical systems devoted to them. However, this version has been challenged by “top-down” proposals, like that of Rapoport (1999), which take into account the considerable extent of association neocortex and related subcortical areas in the human brain. Rapoport’s model assumes that higher-order processes like attention, memory, symbol manipulation, planning and self-awareness might have themselves promoted changes in the brain. Such processes were initially triggered

160

Oana Benga

by a challenging environment that was not physically but, instead, cognitively and socially demanding. Within-brain activation responses to those stresses might have shaped immature brain networks by synaptic consolidation of association circuitry; as a result, the genes of successful adults were spread within populations. Naturally, such brains developed in turn new environments rich in cognitive, social and cultural demands, accelerating further brain evolution. In this equation, one critical component seems to be attention, due to its ability of enhancing the activation of large brain networks. A quite similar approach has been proposed by population geneticists in “gene-culture coevolution” or “dual-inheritance” theories (Feldman & Laland, 1996; Laland & Brown, 2002). The plasticity of human neural structures that allows a large epigenetic development seems to be a strong argument in favor of such theories. However, a critical question that must be solved even beforehand is the very beginning of cognitive systems. If reactive systems are stimulus-driven, the motor output being simply a consequence of the sensory input, the essential feature of cognitive systems is an internal world model that allows planning reactions ahead. One way of thinking, maybe the most straightforward, is to assume the addition of new structures that can perform cognitive functions to the previous non-cognitive ones. The other way is to invoke one and the same structure as being able to perform both functions; simply the disconnection from the motor output and parts of the sensory input is sufficient for the “production” of cognitive functions. Evolution of cognition is thus a form of “exaptation” (Byrne, 2000). Beyond the theoretical model, the assumptions of such evolutionary grounds have already been tested successfully by Cruse (2002) in a neural network type MMC (mean of multiple computation). “Top-down evolution” and the emergence of cognitive systems from noncognitive ones are general principles of a possible framework that will be further applied in this paper. The main proposal here is to reconsider a controversial brain structure, the anterior cingulate cortex, and its role in intentional communication. Before going further, one point that deserves clarification is the very notion of intentionality in communication, to which several meanings have been assigned. In a very philosophical sense, it can mean “aboutness”, a mental state that is directed at, or about certain objects and states of affairs in the world — an idea introduced in psychology by Dennett (1987) but having a previous history in the work of philosophers like Brentano and more recently John Searle. Intentionality is also a term used to describe a purposeful, goal-directed representation or a teleological understanding of the world; but in the meantime much of the literature on social cognition links intentionality with referential understanding (Csibra, 2003), that involves ascription of attentional states, referential intents, communicative

Intentional communication and the anterior cingulate cortex

messages, taking into account what other individuals believe or want. In a narrower sense intentionality involves a controlled, voluntary, “willed” act of action/ communication. The intermingled uses of the concept are rather obvious. For example, in a review on the evolution of language, Hauser, Chomsky and Fitch (2002) propose that a dimension of what they call “the faculty of language in the broad sense”, or FLB, is the conceptual-intentional one, empirically tested via referential vocal signals, voluntary control over signals, rational (goal-directed) imitation and theory of mind/attribution of mental states. Human language is all of these, but it would be more helpful to use a finer grained analysis at multiple levels/ conceptualizations of intentionality, some of them including others. However, in the present analysis, because of the blurry conceptual boundaries that are invoked in the empirical studies, the same neural structure will be connected to different understandings of intentionality. The present focus on intentional communication is not meant to reduce language to this one dimension. Other features are essential for understanding human language properly, for example, the recursive/syntactic component, considered by some, but not all (Jackendoff, 1999), to be the core of language in a narrow sense (Hauser, Chomsky & Fitch, 2002), or the sensory-motor dimension and its corresponding computational mechanisms (like the “mirror system”, imitation, protosign and protospeech systems, Arbib, 2005). Yet the communicative component, although still overlooked, can offer significant insights, some of which will be proposed here.

A second-order neural structure? An intriguing thesis of Corballis (2003), influenced by new approaches committed to the “mirror neurons” hypothesis (Rizzolati & Arbib, 1998), assumes that language evolved originally from gestures, only later incorporating vocal elements. Therefore it is said that language emerged from gestural communication and not from primate calls. Following statements connect vocalization in primates to the anterior cingulate cortex, but this means a total lack of cortical control (Hauser, 1996) or voluntary control, even the variations in vocalization reflecting only the emotional arousal (Tomasello & Call, 1997) — in other words no semantic or syntactic content. On the other hand, because gestural communication has intentional content, it is suggested that the proper beginning of vocal speech involved a shift in the mechanism of control. In order to include vocal elements in intentional communication the shift was made from anterior cingulate cortex (linked/if not considered

161

162

Oana Benga

part of subcortical structures) to Broca’s area. A suggestive metaphor used by Corballis considers the cortical mirror-neuron system “the piano player” while the anterior cingulate/subcortical vocal system is “the piano”. Some aspects of this argument are problematic. First, if intentional means voluntary control, the anterior cingulate cortex is considered “not cortical enough,” thus the need for a hierarchical subordination to Broca’s area in order to perform higher order tasks. Also, the focus on cognitive dimensions as being essentially human, and the minimization of the emotional component in linguistic communication, both as a reason for communication and as a way of expression, is rather misinformed. More nuanced views could suggest the combination of cognitive/ semantic and emotional components; more comprehensive models have to be generated in the future.

Human specificity of the limbic structures In a very rationalist tradition, cognition marks the human species, with anything concerning emotions regarded as “second order”, “inferior”, “ancient”, or anterior in terms of phylogeny. Traditional views usually consider the anterior cingulate cortex as part of the Papez circuit, which being a phylogenetically ancient thalamolimbic circuit is less useful in explaining the typical human gain in rationality. A similar view was put forward by MacLean who, in 1949, proposed the hypothesis of the “triune brain”. In his opinion brain evolution involves a series of concentric shells around an ancient reptilian core: the paleomammalian cortex, which includes the anterior cingulate cortex, and above it the neomammalian structures or the neocortex (Allman et al., 2001). Yet, although humans have bigger brains than predicted by the primate trend — neither body weight nor adjusted body weight (oxygen consumption) being able to predict/explain the brain size (Armstrong, 1990) — there are a few critical issues to consider when thinking about brain/behavior relationships in evolutionary terms. According to Armstrong, not the overall size but the particular configurations and the size of neural circuits are responsible for all typically human behavioral outputs. Also, these consequences should not be explained solely in terms of intelligence for neuroanatomical reasons. In terms of circuitry, nonallometric changes (changes that would not be predicted by scaling the brain to a different size) are not only present in prefrontal regions but also in the limbic system — for example, in the enlarged mediodorsal nucleus of human thalamus and (in terms of number of neurons) in the anterior principal complex of thalamus. Regarding this last structure, Armstrong, Hill & Clark (1987) have hypothesized that it correlates with social organization of human anthropoids. Based on archeological

Intentional communication and the anterior cingulate cortex

data, Armstrong and his colleagues claim a history of 2 million years (the age of Homo habilis) for these limbic changes which are not part of a primate trend. To account for them, it is pretty obvious that just by adding “more rationality” one cannot explain human evolution. Motivation, attention and processing of behavioral aspects involved in social organization are “critical thus for the kind of intelligence humans possess”. The aforementioned nuclei are however connected to other areas, suggesting an integration of cognition and emotion. For example, in humans, axons from the anteroinferior part of the anterior principal complex end in anterior cingulate cortex Brodmann area (BA) 24 — a region which will be described later as having a nodal role in cognition-emotion integration —, while in non-human primates it seems that fewer axons go to this part and more to posterior cingulate cortex (area 23), which has mainly motor functions. Afferentation also comes from cingulate cortex and subiculum. The mediodorsal complex ends in the granular prefrontal cortex and can even be degenerated in case of frontal cortex loss. It has reciprocal connections to frontal eye fields, but also to rostral, orbital and dorsolateral frontal cortex.

Anterior cingulate cortex — A controversial architecture The lack of consideration regarding the anterior cingulate cortex (ACC) is mostly determined by the heterogeneity of its cytoarchitecture (Zilles, 1990). For a long while it has been viewed as a part of the limbic system, so that BA 24, 25, 32, 33 which constitute the agranular anterior cingulate cortex were opposed to granular areas 23 and 31 contained by the posterior cingulate cortex. But while areas 33 and 25 have a poor laminar differentiation and lack the second layer — outer granular layer, having a 5-layered cortical structure, areas 24 and 32 are considered to have an isocortical structure that is more closely similar to prefrontal cortex. This is the reason why in many classifications they are regarded as parts of prefrontal cortex. The inner granular cortex — layer four — is absent in area 24 and difficult to recognize in area 32. Yet the lack of layer 4 is still an aspect that resembles the motor areas neighbouring the anterior cingulate. So, rather than providing evidence for the primitivity of this region, it is a sign of its affinity with neocortical motor areas. This is why many authors divide prefrontal cortex in a granular part, the lateral prefrontal one — areas 8–12 and 44–47 —, and an agranular part, the medial prefrontal cortex, containing areas 24 and 32. Many authors agree that, along with the most dorsal and medial parts of area 9, areas 24 and 32 participate in organization and recruitment of other cortical areas whenever there is need for action, access to prior instruction or analysis, or a need for sensory discrimination which is followed by cues for voluntary movement.

163

164

Oana Benga

Other macrostructural characteristics The functional complexity attributed to the anterior cingulate cortex is largely due to its heterogeneity, this aspect undermining possible attempts to create a unitary model of the structure. The overlapping functions of the ACC are generally linked to the existence of at least three major functional subdivisons — affective, cognitive and motor components — with different underlying patterns of connectivity and distinct cytoarchitectonic properties (Devinsky, Morell & Vogt, 1995, Bush, Luu & Posner, 2000, Paus, 2001, Pickard & Strick, 2001, Yücel et al., 2003). Cingulum is a collar around the corpus callossum, on the medial wall of each cerebral hemisphere. One major distinction within ACC is the one between the ventral-limbic (cingulate proper) region versus the dorsal-paralimbic (paracingulate) one (see Figure 1). In 30–50%

Figure 1. Functional divisions (cognitive/affective) of the anterior cingulate cortex, based on a meta-analysis of neuroimaging studies showing activations (a) and deactivations (b) during cognitive and emotional tasks, in 2-D coordinates. The cognitive division is activated by Stroop and Stroop-like tasks, divided attention tasks, and complex response selection tasks, but deactivated (showing reduced blood flow or MR signal) by emotional tasks. The affective division is activated by tasks that relate to affective or emotional content, or symptom provocation, and deactivated by cognitive tasks. A same group direct comparison indicates the activation of cognitive division during the cognitive Counting Stroop (orange triangle) (Bush et al., 1998), respectively of affective division (blue diamond) during the Emotional Counting Stroop (Whalen et al., 1998). Cognitive Counting Stroop activation in the cognitive division is present in matched normal controls (yellow triangle) but absent in subjects with attention-deficit/hyperactivity disorder (Bush et al., 1999). (Reprinted from Trends in Cognitive Science, 4(6), Bush, G., Luu, P & Posner, M. Cognitive and emotional influences in anterior cingulate cortex. 215–222. Copyright ©2000, with permission from Elsevier.)

Intentional communication and the anterior cingulate cortex

of individuals these two regions are separated by the paracingulate sulcus (Paus, 2001), the paracingulate forming a separate gyrus on the medial wall surface of the brain. An interesting asymmetry has been found related to the paracingulate sulcus: in the left hemisphere its incidence and volume have been linked to speech/ word generation (Crosson et al., 1999), while in the right hemisphere it seems to covariate with a temperamental disposition toward fear and anticipatory reward (Pujol et al., 2002). The ventral anterior cingulate has been acknowledged as the affective/visceral division — rostral areas 24 a–c and 32 and ventral areas 25 and 33 — connected to the amygdala, periaqueductal gray, hypothalamus, anterior insula, hippocampus and orbitofrontal cortex. The dorsal ACC is generally considered the cognitive division — areas 24 a′–c′ and 32′ — being connected to lateral prefrontal cortex (46/9), parietal cortex (7), premotor and supplementary motor areas. The ventral affective division is thought to be responsible for evaluating the salience of emotional information and for emotion regulation, while dorsal cognitive division is responsible for executive functions like conflict and error monitoring, as being proved by human as well as animal studies (Bush, Luu & Posner, 2000, Johansen, Field & Manning, 2001, Cardinal et al., 2002, Swick et al., 2002, Hadland et al., 2003, Yücel et al., 2003). The functional overlap within ACC is thought to distinguish it from other frontal regions, as a pivotal structure for translating intentions into actions (Paus, 2001).

Microstructural characteristics Nimchinsky et al. (1999) reported the existence in layer 5b BA 24 of a population of spindle-shaped neurons, different from usual pyramidal neurons which have an array of basal radiating dendrites, because of a specific conformation: 1 single apical upwards and 1 single basal downwards dendrites (see Figure 2). These neurons are also four times larger than the average pyramidal cell and they have long distance projections. The concentration of spindle cells is greatest in humans and decreases with increasing taxonomic distance from our species. The age of origin for this neuronal type might be 15 million years ago, considering that spindleneurons are absent in 23 primate species including gibbon, New World monkeys, Old World monkeys and prosimians, and also in 30 mammalian species studied, but they are present in bonobo (clustered), chimpanzee (with a density comparable to humans), gorilla and orangutan. Just as important is the fact that the average volume of cell bodies is a function of the relative brain size (encephalization), which is not the case for other neuronal subtypes like pyramidal neurons in layer 5 or fusiform cells in layer 6. Authors suggest that this is probably related to the size of axonal arborization.

165

166

Oana Benga

Figure 2. Spindle cells in layer Vb of the anterior cingulate cortex in human (A), bonobo (B), common chimpanzee (C), gorilla (D), and orangutan (E) (with similar morphology and apparent somatic size). In the anterior cingulate cortex of the white-handed gibbon (F), patas monkey (G), or ring-tailed lemur (H) spindle cells are absent. (Reprinted with permission from Nimchinsky, E.A., Gilissen, E., Allman, J.M., Perl, D.P., Erwin, J.M., and Hof, P.R., 1999. A neuronal morphologic type unique to humans and great apes. PNAS, 96, 5268–5273. Copyright ©1999. National Academy of Sciences, U.S.A)

Distribution of spindle neurons in layer 5b of areas 24a, 24b (most abundant), 24c, and less in area 24′ which makes transition between anterior and posterior cingulate cortex, is proving that they do not characterize a somatic motor area. Ontogenetically post-mortem analysis seem to suggest that in humans spindle cells can be noticed only at 4 months of age, yet Hayashi, Ito and Shimizu (2001) have reported the existence of spindle neurons in the BA 24b of a chimpanzee fetus, embryonic day 224. Allman et al. (2001) have suggested that the presence and some migratory features of these neurons in four to eight-month-old infants coincide with voluntary motricity, focused attention and emotional expression; we would also add the emergence of emotional and attentional regulation, as proposed by Posner and Rothbart (2000). Their survival is expected to be enhanced or reduced by environmental conditions of enrichment versus stress (60% of these neurons are destroyed in Alzheimer’s disease, as reported by Nimchinsky et al., 1995) so they might influence adult competence or dysfunction in emotional selfcontrol and problem-solving ability.

Intentional communication and the anterior cingulate cortex

Another unique characteristic of the anterior cingulate cortex has been described in the superficial half of layer 5 of BA 24, 25 and less in 32, where another population of calcium-binding protein calretinin pyramidal neurons has been described (Hof et al., 2001). These neurons are also present in the highest number in humans, followed by chimpanzees, gorillas and lowest in orangutans, the basic conjecture being their involvement in “vocalization, facial expression or autonomic functions” (p. 142). These specific types of neurons in the ACC are still quite singular candidates for acknowledging the uniqueness of primate evolution (Hacia, 2001).

Anatomical and functional interconnections of the anterior cingulate cortex Anatomical interconnectivity of the anterior cingulate cortex with other brain structures has been shown even in other mammals, e.g., a recent study by Gabbott et al. (2003) demonstrating the presence of feed-forward connections from rat insular cortices to areas 32 and 25, as well as of feed-back connections from those to the former ones, defining the integration of viscerosensory with visceromotor and cognitive information. Interestingly, the authors sustain the heterogeneity of the anterior cingulate structures, ventral structures (areas 25 and 32) being considered mostly of an affective nature, only area 32 sharing also a cognitive function along with the more dorsal area 24d. However, at least in rat the interconnections between these divisions themselves are sparse, therefore information processing is possible in parallel, in interconnection or even in isolation from neighboring regions, depending on the task. According to Barbas (1995, 1997), monkey areas 24, 25 and 32 have more widespread connections relative to eulaminate (6 layer) cortical regions. A robust functional connectivity between ACC and lateral prefrontal cortex has also been shown in humans, due to a variety of techniques (positron emission tomography, transcranial magnetic stimulation) (Koski & Paus, 2000; Paus, 2001).

Anterior cingulate cortex and communication Maybe the role of the anterior cingulate cortex in communication can be more easily understood if we focus on the “why” question. Why did language evolve at all? What could have been the role of emerging linguistic abilities? One possible answer would be: to provide an improved way for social communication. Although evolution does not always follow logic in terms of efficiency, it is quite reasonable to think that for this “task” of language emergence it was natural to select structures already proved to serve for communication purposes. The

167

168

Oana Benga

anterior cingulate is one good candidate, given its proven role in vocalization. Yet the traditional view, which cuts “intellectual” functions apart from the “emotional” ones, considers that “emotional cries” of primates are far away from language, although they are the very productions of the anterior cingulate cortex (see Corballis, 2002). We agree with Newman (2003) that such a clear-cut interspecies difference is quite fruitless, because it rests on a disqualification of anterior cingulate based on its “limbic” nature, ignoring its structural properties, as previously noted, and the large interconnectivity it has with other brain structures.

Social bonding Nevertheless, models that underline the limbic nature of the anterior cingulate cortex have some good points. For example, “the triune brain” model of MacLean argues that the thalamocingulate division of the limbic system does not have a counterpart outside the mammals, and that it is involved in nursing, mother-infant audiovocal contact (via infant cry and species-typical isolation call), and play (MacLean, 1990). Interestingly, cross-species comparisons show that the same region is metabolically activated in rat infants by the perception of emotionally relevant maternal calls (Braun & Poeggel, 2001), and also in human mothers by the perception of infant cries (Lorberbaum et al., 2002). One critical argument for the involvement of the ACC in social bonding is the fact that this area is full of oxytocin, opiate, and opiate-like receptors (Steketee, 2003, Sim-Selley et al., 2003).

Control of vocalizations It was already mentioned that the classical view assigns a role to ACC only in automatic, reflex-like emotional vocalizations. However, Jürgens (2002) has shown that, in squirrel monkeys, the area to which spindle cells are restricted is one of the only cortical areas known to elicit meaningful vocalizations — not just sounds –when stimulated, and that the same area is also responsible for voluntary phonation in the macaque. Destruction of this structure abolishes a vocal operant conditioning (but does not affect the reaction to unconditioned stimuli) or abolishes long-distance calls of socially isolated animals (but still allows the response to contact calls of conspecifics). Yet the point of view expressed by Jürgens is that vocal control is hierarchically organized across an extensive network, comprising the forebrain for the voluntary control of vocalization. As part of mediofrontal cortex, along with supplementary and pre-supplementary motor area, ACC is involved in initiation/suppression of vocal utterances (via periaqueductal grey as an obligatory relay), but not pattern

Intentional communication and the anterior cingulate cortex

generation of vocalizations, while the acoustic structure of vocalizations is under the control of motor cortex via pyramidal/corticobulbar as well as extrapyramidal pathways, out of which the most important is the connection motor cortexputamen-substantia nigra-parvocellular reticular formation-phonatory neurons. Several inputs come to motor cortex in order to perform this task: a cerebellar input via ventrolateral thalamus for smooth transitions between vocal elements; a proprioceptive input from the phonatory organs via nucleus ventralis posterior medialis thalami, somatosensory cortex and inferior parietal cortex; an input from ventral premotor and prefrontal cortex, including Broca’s area, for motor planning of longer purposeful utterances; and from the ventral premotor and presupplementary motor area, giving rise to the motor commands executed by the motor cortex. Inside this model, ACC has itself connections with lateral prefrontal cortex, supplementary motor cortex, Broca’s area and motor face cortex. Actually, Simonyan and Jürgens (2002) take into consideration the massive projections to and from areas 44 and 45 to the ACC, but the role confined to this structure is, in their opinion, limited in monkeys to vocalization initiations, while in humans to vocal control over emotional intonation, non-vocal emotional tasks or control of innate motor patterns.

Neuroimaging studies of linguistic tasks Vocalizations, even controlled, are by no means the same thing as human language in all its complexity. Still, there are some studies suggesting a direct implication of the anterior cingulate cortex in semantic or syntactic tasks. For example, Abdullaev & Posner (1998) showed that generating the use of a word activates ACC at 150 ms after the stimulus, prior to the activation of Broca’s area which occurs only at 200 ms. Moreover, Dogil et al. (2002) presented evidence of ACC involvement in syntactic processing as well. Such data seem to confirm the prediction made by Nimchinsky et al. (1999), who explicitly stated that spindle neurons are required for higher brain functions like communication and language.

Lesion data and the initiation of speech A step forward was made by Newman in 2003, based on data reporting cases of mutism in patients following infarcts in the anterior cingulate cortex (as described by Brown in 1979 and 1988). Newman has proposed that ACC might be involved in early mother-infant dialogues, including motherese, but also in early speech development, particularly vowel production. Whether this conjecture is true or false is still a matter of debate.

169

170

Oana Benga

Still, if we accept Falk’s view (2004), motherese could have played a crucial role, protolanguage in early hominins emerging from prelinguistic vocalization similar to contemporary infant-directed speech, and parental prosody being not only an instrument for propagating language, but an important substrate for the natural selection of protolanguage itself. An important point of Falk regards the multimodal nature of motherese in humans, the ontogenetic and phylogenetic routes evolving towards increasingly intentional and instrumental rather than emotional vocalizations, coupled with manual gestures. Still, in line with my previous arguments, there is a postulated continuity from vocalizations of ape ancestors to prelinguistic vocalizations of early hominins, gestural communication being acknowledged by Falk as an important complement to speech-based communication. Back to mutism, other authors refer to ACC lesions in this range of neuropathology. Damasio and Van Hoesen (1983) reported a case of left anterior cingulate lesion that seemed to affect the need of talking/replying in conversation, as if the patient had “nothing to say”. Devinsky, Morrell and Vogt (1995) presented similar results, as well as Cohen et al. (1999), who showed that even a 5 mm diameter bilateral lesion of ACC seemed to reduce spontaneous behavior one year later — such as verbal utterances, written statements and also constructive actions — (unfortunately it is not mentioned if the gestural component was also affected). Related to the cases described by Brown, Tucker (2001) has pinpointed that aphasics who had only motor deficits seem to have Broca’s area affected, while semantic deficits were associated with lesions of the limbic structures. The approach proposed by the author starting from here is that of “embodied meaning”, stating that the landmark of human language is neither the production of speech sounds, nor that of grammar, but the fact that “we have anything interesting to say”. Or this very ability is linked to motivation, emotional evaluation and attentional control. In his model, the cingulate cortex is the node that properly connects limbic structures to cortical ones, in an act of hierarchical integration that is needed for meaning construction.

Anterior cingulate cortex and social cognition Neuroimaging data Several studies on mentalizing have shown that the most consistent brain activation is in the anterior cingulate cortex (Frith & Frith, 2001, 2003, Mundy, 2002). For example, Gallagher et al. (2000) have shown, using fMRI, that both verbally and nonverbally presented theory of mind tasks involve bilateral activation of BA 32 and BA 24.

Intentional communication and the anterior cingulate cortex

The metanalysis of Gallagher and Frith (2003), comprising tasks devoted to causality, intentionality and self-perspective, sustains the involvement of these areas, along with that of medial frontal regions BA 8 and 9 (area 32 lying in between the two). The meaning of this consistent activation is still not very clear. For Gallagher & Frith (2003) it is a sign that ACC is specialized in directing attention to mental states. Other authors, like Allman, Hakeem and Watson (2002), consider that ACC is phylogenetically specialized — along with BA 10 — to convey the motivation to act, initiating adaptative responses that help self-regulation, as the individual matures and gains social insight.

Joint attention and the anterior cingulate cortex Mundy (2001) suggested that in infants as well as in human adults the most consistent correlates of joint attention skills across studies are activations in dorsal medial frontal cortex and related cingulate areas. The status of joint attentional skills is twofold. On one hand, they are considered prerequisites of the social cognition or theory of mind. On the other hand, they represent for many authors prelinguistic communication acts. A common distinction in developmental studies is the one between initiating behavior regulation (IBR) or imperative joint attention, which contains the acts of requesting an object or action and initiating joint attention (IJA) or declarative joint attention which is sharing the positive affect or interest in a referent or event. The distinction between these two types of joint attention can be sustained on the basis of the following arguments, as shown by Henderson et al. (2002): – IJA is more frequently related to later language development (Mundy & Gomes, 1998, Ulvand & Smith, 1996) – IJA distinguishes groups of children with autism from children with other disabilities (Baron-Cohen et al., 1992, Mundy, 1995) – IJA requires distinctive neural substrates from those involved in IBR. In this sense, Caplan et al. (1993) showed on a PET study with epileptic children that preoperative glucose metabolism in the left frontal regions positively predicted the children’s postoperative IJA behaviors. Mundy, Card, and Fox (2000) presented data suggesting that IJA at 14 months coupled with a complex pattern of EEG activity in the 4–6 Hz band, that indicates left medial-frontal EEG activation, predict IJA at 18 months. The frontal correlates were obtained from electrodes positioned on a point of confluence of BA 8 and 9, that evoke activation from the anterior cingulate cortex as well. Similar results were obtained by Henderson et al. (2002).

171

172

Oana Benga

3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00

HIGH AS

LOW AS

Figure 3. Mean number of initiating joint attention (IJA) behaviors in high versus low antisaccade performance (AS) subjects (difference significant at p<.05). From Benga, Nakagawa & Sukigara, 2002.

An interpretation that tries to account for all these data (Mundy, 2001) suggests that IJA is a component of an integrative system predisposing the child to find social interaction rewarding and to share positive affect/experiences with the others. Own research data are also in line with these findings. In a study on 9–11 months children, Benga, Nakagawa, and Sukigara (2002) showed that IJA behaviors correlate with antisaccade performance — the ability to suppress automatic saccades (measured using the paradigm proposed by Johnson, 1994). The emergence of IJA abilities can be studied only later in development, but the suppression of automatic saccades is one of the first inhibitory mechanisms already functional at 4–6 months, linked to the activity of frontal eye fields (BA 8) and also to the anterior cingulate cortex. Benga, Nakagawa, and Sukigara (2002) obtained no correlation between IBR/imperative joint attention and antisaccade (AS) performance, but a significant correlation between IJA/declarative joint attention and AS performance, even more noticeable when considering the second part of the antisaccade task — which actually is expected to be defined by an increasing number of correct responses (Johnson, 1994). Children with over than 50% antisaccades in overall responses (high AS) displayed a significantly higher number of IJA behaviors compared to low AS (Figure 3). Although this was not a longitudinal study, it seemed that antisaccades precede and predict initiating joint attention behaviors.

Conclusions Combining all arguments presented so far, the acknowledgment of the anterior cingulate cortex as a critical structure involved in intentional communication cannot be denied. Beyond structural and functional arguments favoring the role played by ACC mostly in vocal communication, our pilot empirical data sustain indirectly its involvement in gestural communication, more specifically in initia-

Intentional communication and the anterior cingulate cortex

tion of joint attention behaviors. Since the intentional dimension of gestural communication seems to be connected to a region previously equipped for vocalization, ACC might well be a starting point for linguistic communication for the sake of communication. Still several facets of intentionality have all been related to this structure, so the task of future analyses would be to clarify what are the relationships between them. Also, the arguments sustained here favor the strategic view of the ACC (Posner & Dehaene, 1994, Posner & DiGirolamo, 1998, Posner & Fan, 2002) which suggests its involvement in selection for action, suppression of automatic/routine behaviors, error correction. However, it is still a matter of debate whether the anterior cingulate cortex is involved in such tasks or it only has an evaluative function (Carter et al., 2000, van Veen & Carter, 2002), being used just for error detection and conflict monitoring.

Credits for reprinted Figures Figure 1, page 164 Bush, G., Luu, P., & Posner, M. (2000). Cognitive and emotional influences in anterior cingulate cortex. Trends in Cognitive Sciences, 4(6), 215–222. Reprinted with permission from Elsevier. Copyright © 2000 Elsevier.

Figure 2, page 166 Nimchinsky, E.A., Gilissen, E., Allman, J.M., Perl, D.P., Erwin, J.M., & Hof, P.R. (1999). A neuronal morphologic type unique to humans and great apes. PNAS, 96, 5268–5273. Reprinted with permission from National Academy of Sciences, U.S.A. Copyright ©1999. National Academy of Sciences, U.S.A.

References Abdullaev, Y. G., & Posner, M. I. (1998). Event-related brain potential imaging of semantic encoding during processing single words. Neuroimage, 7, 1–13. Allman, J., Hakeem, & A., Watson, K. (2002). Two phylogenetic specializations in the human brain. The Neuroscientist: a Review Journal Bringing Neurobiology, Neurology and Psychiatry, 8, 335–346. Allman, J. M., Hakeem, J. M., Erwin, J. M., Nimchinsky, E., & Hof, P. (2001). The anterior cingulate cortex: The evolution of an interface between emotion and cognition. Annals of the New York Academy of Sciences, 935, 107–117.

173

174

Oana Benga

Arbib, M. (2005). From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics. Behavioral and Brain Sciences, 28, 105–167. Armstrong, E. (1990). Evolution of the brain. In G. Paxinos (Ed.), The Human Nervous System. Academic Press, San Diego. Armstrong, E., Hill, E. M., & Clark, M. (1987). Relative size of the anterior thalamic nuclei differentiates anthropoids by social organization. Journal of Comparative Neurology, 253, 539–548. Barbas, H. (1995). Anatomic basis of cognitive-emotional interactions in the primate prefrontal cortex. Neuroscience & Biobehavioral Reviews, 19, 499–510. Barbas, H. (1997). Two prefrontal limbic systems: Their common and unique features. In H. Sakata, A. Mikami, J. Fuster (Eds.), The association cortex: structure and function. Amsterdam: Harwood Academic Publishers. Baron-Cohen, S., Allen, J., & Gillberg, C. (1992). Can autism be detected at 18 months? The needle, the haystack, and the CHAT. The British Journal of Psychiatry, 161, 839–843. Benga, O., Nakagawa, A., & Sukigara, M. (2002). Joint-attention, attention and neural structures: a possible conjecture. Poster presented at Euresco Conference “Brain Development and Cognition in Human Infants”, Acquafredda Di Maratea, Italy. Braun, K., & Poeggel, G. (2001). Recognition of mother’s voice evokes metabolic activation in the medial prefrontal cortex and lateral thalamus of Octodon degus pups. Neuroscience, 103, (4), 861–864. Brown, J. W. (1979). Language representation in the brain. In H. D. Skelis, M. J. Raleigh (Eds.), Neurobiology of social communication in primates. New York: Academic Press. Brown, J. W. (1988). Cingulate gyrus and supplementary motor correlates of vocalization in man. In J. D. Newman (Ed.), The physiological control of mammalian vocalization. New York: Plenum Press. Bush, G., Whalen, P. J., Rosen, B. R., Jenike, M. A., McInerney, S. C., & Rauch, S. L. (1998). The Counting Stroop: An interference task specialized for functional neuroimaging — validation study with functional MRI. Human Brain Mapping, 6, 270–282. Bush, G., Frazier, J. A., Rauch, S. L., Seidman, L. J., Whalen, P. J., Jenike, M. A., Rosen, B. R., & Biederman, J. (1999). Anterior cingulate cortex dysfunction in attention-deficit/hyperactivity disorder revealed by fMRI and the Counting Stroop. Biological Psychiatry, 45, 1542–1552. Bush, G., Luu, P. & Posner, M. (2000). Cognitive and emotional influences in anterior cingulate cortex. Trends in Cognitive Sciences, 4(6), 215–222. Byrne, R. W. (2000). Evolution of primate cognition. Cognitive Science, 24, 543–570. Caplan, R., Chugani, H., Messa, C., Guthrie, D., Sigman, M., Traversay, J., & Mundy, P. (1993). Hemispherectomy for early onset intractable seizures: Presurgical cerebral glucose metabolism and postsurgical nonverbal communication patterns. Developmental Medicine and Child Neurology, 35, 574–581. Cardinal, R., Parkinson, J. B., Hall, J., Everitt, B. (2002). Emotion and motivation: The role of the amygdala, ventral striatum, and prefrontal cortex. Neuroscience & Biobehavioral Reviews, 26, 321–352. Carter, C., Macdonald, M., Ross, L., Stenger, V. A., Noll, D., & Cohen, J. (2000). Parsing executive processes: Strategic vs evaluative functions of the anterior cingulate cortex. PNAS, 94(4), 1994–1948. Cohen, J., Botvinick, M., & Carter, C. (2000). Anterior cingulate and prefrontal cortex: Who’s in control? Nature Neuroscience, 3(5),421–423.

Intentional communication and the anterior cingulate cortex

Cohen, R. A., Kaplan, R. F., Zuffante, P., Moser, D. J., Jenkins, M. A., Salloway, S. & Wilkinson, H. (1999). Alteration of intention and self-initiated action associated with bilateral anterior cingulotomy. Journal of Neuropsychiatry & Clinical Neurosciences, 11, 444–453. Corballis, M. C. (2003). From mouth to hand: Gesture, speech and the evolution of right-handedness. Behavioral and Brain Sciences 26(2), 199–208. Crosson, B., Sadek, J. R., Bobholz, J. A., Gökçay, D., Mohr, C. M., Leonard, C. M., Maron, L., Auerbach, E. J., Browd, S. R., Freeman, A. J. & Briggs, R. W. (1999). Activity in the paracingulate and cingulate sulci during word generation: An fMRI Study of Functional Anatomy. Cerebral Cortex, 9: 307–316. Cruse, H. (2002). The evolution of cognition — a hypothesis. Cognitive Science, 102, 1–21. Csibra, G. (2003). Teleological and referential understanding of action in infancy. Philosophical Transactions of the Royal Society, London B, 358, 447–458. Damasio, A. R. & Van Hoesen, G. (1983) Emotional disturbances associated with focal lesions of the limbic frontal lobe. In Heilman, K. M. & Satz, P. (Eds.) Neuropsychology of human emotion. New York: Guilford Press. Dennett, D. C. (1987). The intentional stance. Cambridge, MA: MIT Press. Devinsky, O., Morrell, M. J., & Vogt, B. A. (1995). Contribution of anterior cingulate cortex to behaviour. Brain, 118, 279–306. Dogil, G., Ackermann, H., Grodd, W., Haider, H., Kamp, H., Mayer, J., Riecker, A., & Wildgruber, D. (2002). The speaking brain: A tutorial introduction to fMRI experiments in the production of speech, prosody and syntax. Journal of Neurolinguistics, 15, 59–90. Falk, D. (2004). Prelinguistic evolution in early hominins: Whence motherese? (target article). Behavioral and Brain Sciences, 27, 491–503. Feldman, M. W., & Laland, K. N. (1996). Gene-culture coevolutionary theory. Trends in Ecology and Evolution, 11, 453–457. Frith, U., & Frith, C. D. (2003). Development and neurophysiology of mentalizing. Proceedings of the Royal Society, London B, 358, 459–475. Frith, U., & Frith, C. (2001). The biological basis of social interaction. Current Directions in Psychological Science, 10, 151–155. Gabbott, P. L. A., Warner, T. A., Jays, P. R. L. & Bacon, S. J. (2003). Areal and synaptic interconnectivity of prelimbic (area 32), infralimbic (area 25) and insular cortices in the rat. Brain Research 993(1–2), 59–71. Gallagher, H. L., & Frith, C. D. (2003). Functional imaging of “theory of mind”. Trends in Cognitive Sciences, 7(2), 77–83. Gehring, W., Knight, R. (2000). Prefrontal-cingulate interactions in action monitoring, Nature Neuroscience, 3(5), 516–520. Hacia, J. (2001). Genome of the apes. Trends in Genetics,17, 637–645. Hadlang, K. A., Rushworth, M. F. S., Gaffan, D., Passingham, R. E. (2003). The effect of cingulate lesions on social behaviour and emotion. Neuropsychologia, 41, 919–931. Hauser, M. D. (1996). Vocal communication in macaques: Causes of variation. In J. Fa and D. Lindburg (Eds.), Evolutionary ecology and behavior of macaques. Cambridge: Cambridge University Press. Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The faculty of language: What is it, who has it, and how did it evolve? Science, 298, 1569–1579. Hayashi, M., Ito, M., & Shimizu, K. (2001). The spindle neurons are present in the cingulate cortex of chimpanzee fetus. Neuroscience Letters, 309, 97–100.

175

176

Oana Benga

Henderson, L. M., Yoder P. J., Yale, M. E., & McDuffie, A. (2002). Getting the point: electrophysiological correlates of protodeclarative pointing. International Journal of Developmental Neuroscience, 20, 449–458. Hof, P., Nimchinsky, E., Perl, D., & Erwin, J. M. (2001). An unusual population of pyramidal neurons in the anterior cingulate cortex of hominids contains the calcium-binding protein calretinin. Neuroscience Letters, 307, 139–142. Jackendoff, R. (1999). Possible stages in the evolution of the language capacity. Trends in Cognitive Sciences, 3, 272–279. Johansen, J., Fields, H., & Manning, B. H. (2001). The affective component of pain in rodents: Direct evidence for a contribution of the anterior cingulate cortex. PNAS, 98, 8077–8082. Johnson, M. H. (1995). The inhibition of automatic saccades in early infancy. Developmental Psychobiology, 28(5), 281–291 Jürgens, U. (2002). Neural pathways underlying vocal control. Neuroscience & Biobehavioral Reviews, 26, 235–258 Koski, L., & Paus, T. (2000). Functional connectivity of the anterior cingulate cortex within the human frontal lobe: a brain-mapping meta-analysis. Experimental Brain Research, 133, 55–65. Laland, K. N., & Brown, G. R. (2002). Sense and Nonsense — Evolutionary Perspectives on Human Behaviour. Oxford: Oxford University Press. Lorberbaum, J. P., Newman, J. D., Horwitz, A. R., Dubno, J. R., Lydiard, R. B., Hamner, M. B., Bohning, D. E., & George, M. S. (2002). A potential role for thalamocingulate circuitry in human maternal behavior. Biological Psychiatry, 51, 431–445. MacLean, P. D. (1985). The triune brain in evolution. New York: Plenum Press. Mundy, P. (1995). Joint attention and social-emotional approach behavior in children with autism. Development and Psychopathology, 7, 63–82. Mundy, P. (2001). Joint attention, theory of mind and the medial frontal cortex. Paper presented at the International Meeting for Autism Research (IMFAR), San Diego, CA Mundy, P. & Gomes, A. (1998). Individual differences in joint attention skill development in the second year. Infant Behavior and Development, 21, 469–482. Mundy, P., Card, J. & Fox, N. (2000). EEG correlates of the development of infant joint attention skills. Developmental Psychobiology, 36, 339. Newman, J. D. (2003). Vocal communication and the triune brain. Physiology and Behavior, 79, 495–502. Nimchinsky, E. A., Gilissen, E., Allman, J. M., Perl, D. P., Erwin, J. M., & Hof, P. R. (1999). A neuronal morphologic type unique to humans and great apes. PNAS, 96, 5268–5273. Nimchinsky, E. A., Vogt, B. A., Morrison, J. H., & Hof, P. R. (1995). Spindle neurons of the human anterior cingulate cortex, Journal of Comparative Neurology, 355, 27–37. Paus, T. (2001). Primate anterior cingulate cortex: where motor control, drive and cognition interface. Nature Reviews Neuroscience, 2, 417–424. Pickard, N., & Strick, P. L. (2001). Imaging the premotor areas. Current Opinion in Neurobiology, 11, 663–672. Posner, M. I., & Dehaene, S. (1994). Attentional networks. Trends in Neuroscience, 17, 75–79. Posner, M. I. & Rothbart, M. K. (2000). Developing mechanisms of self-regulation. Development and Psychopatology, 12, 427–441.

Intentional communication and the anterior cingulate cortex

Posner, M. I. and DiGirolamo, G. J. (1998). Executive attention: Conflict, target detection, and cognitive control. In R. Parasuraman (Ed.) The Attentive Brain. Cambridge, MA: The MIT Press. Posner, M. I., & Fan, J. (2002). Attention as an organ system. In J. Pomerantz (Ed.), Neurobiology of perception and communication: From synapses to society. Cambridge UK: Cambridge University Press. Pujol, J., Lopez, A., Deus, J., Cardoner, N., Vallejo, J., Capdevilla, A., & Paus, T. (2002). Anatomical variability of the anterior cingulate gyrus and basic dimensions of human personality. Neuroimage, 15, 847–855. Rapoport, S. (1999). How did the human brain evolve? A proposal based on new evidence from in vivo brain imaging during attention and ideation. Brain Research Bulletin, 50, 149–165. Rizzolatti, G. & Arbib, M. A. (1998). Language within our grasp. Trends in Neurosciences, 21, 188–194. Simonyan, K., & Jürgens, U. (2002). Cortico-cortical projections of the motocortical larynx area in the rhesus monkey. Brain Research, 949, 23–31. Sim-Selley, L. J., Vogt, L. J., Childers, S. R., & Vogt, B. A. (2003). Distribution of ORL‑1 receptor binding and receptor activated G-proteins in rat forebrain and their experimental localization in anterior cingulate cortex. Neuropharmacology, 45, 220–230. Steketee, J. D. (2003). Neurotransmitter system of the medial prefrontal cortex: Potential role in sensitization to psychostimulants. Brain Research Reviews, 41, 203–228. Swick., D., & Javanovic, J. (2002). Anterior cingulate cortex and the Stroop task: Neuropsychological evidence for topographic specificity. Neuropsychologia, 40, 1240–1253. Tomasello, M. & Call, J. (1997). Primate Cognition. Oxford: Oxford University Press. Tucker, D. M. (2001). Embodied meaning: An evolutionary-developmental analysis of adaptive semantics. Institute of Cognitive and Decision Sciences, Technical Report no. 01–04. Ulvund, S., & Smith, L. (1996). The predictive validity of nonverbal communication skills in infants with perinatal hazards. Infant Behavior and Development, 19, 441–449 Van Veen, V., & Carter, C. (2002). The anterior cingulate as a conflict monitor: fMRI and ERP studies. Physiology and Behavior, 6800, 1–6. Whalen, P. J., Bush, G., McNally, R. J., Wilhelm, S., McInerney, S., & Rauch, S. L. (1998). The emotional counting Stroop paradigm: An fMRI probe of the anterior cingulate affective division. Biological Psychiatry, 44, 1219–1228. Yücel, M., Wood, S. J., Fornito, A., Riffkin, J., Velakoulis, D., & Pantelis, C. (2003). Anterior cingulate dysfunction: Implications for psychiatric disorders? Review of Psychiatry and Neuroscience, 28,350–354. Zilles, C. (1990). Cortex. In G. Paxinos (Ed.), The Human Nervous System. San Diego, CA: Academic Press.

About the author Oana Benga, Ph.D. is Associate Professor and Director of the Program of Cognitive Neuroscience at the Department of Psychology, Babeş-Bolyai University, Cluj-Napoca, Romania. Having a background in Psychology and Biology, her research interests encompass developmental aspects of executive function and social cognition, as well as their neural correlates, in infants and children with normal or atypical developmental patterns (autism, ADHD, epilepsy, anxiety).

177

Gestural-vocal deixis and representational skills in early language development Elena Pizzuto1, Micaela Capobianco2, Antonella Devescovi2

1Institute

of Cognitive Sciences and Technologies, National Research Council, Rome / 2University of Rome “La Sapienza”, Department of Psychology of Developmental Processes and Socialization

This study explores the use of deictic gestures, vocalizations and words compared to content-loaded, or representational gestures and words in children’s early oneand two-element utterances. We analyze the spontaneous production of four children, observed longitudinally from 10–12 to 24–25 months of age, focusing on the components of children’s utterances (deictic vs. representational), the information encoded, and the temporal relationship between gestures and vocalizations or words that were produced in combination. Results indicate that while the gestural and vocal modalities are meaningfully and temporally integrated form the earliest stages, deictic and representational elements are unevenly distributed in the gestural vs. the vocal modality, and in one vs. two-element utterances. The findings suggest that while gestural deixis plays a primary role in allowing children to define and articulate their vocal productions, representational skills appear to be markedly more constrained in the gestural as compared to the vocal modality. Keywords: gestures, words, vocalizations, deictic, representational, early language

1. Introduction The purpose of this paper is to explore and describe the relationship between gestures, vocalizations and words during two key, universal phases of early language development: the one-word stage, when children begin to produce their first words but for several months appear to be unable to combine them in longer strings, and the transition to two- and multiword speech, when the first combinations of words appear, then become progressively more frequent and articulated in their meanings while, at the same, vocabulary increases at a fast rate. As shown

180

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

by crosslinguistic investigations of a wide variety of languages, this developmental progression can be characterized as a universal feature of language learning (Slobin, 1985; 1992; 1997; Clark, 2003) Several studies conducted in the last two decades have shown that in these early stages, and prior to the appearance of two-word utterances, children also produce what in broad terms can be defined utterances consisting of single gestures (these appear before or along with single words), crossmodal combinations of gestures and words, and even (much more infrequently) two- gesture combinations (for some of the most relevant papers and reviews see Lock, 1980; Volterra & Erting, 1990; Blake, 2000 — and below). These findings are of particular interest when framed within the new view of the multimodal features of spoken language structure emerging from the study of gesture in adult language (see for overviews and relevant discussions: Kendon, 1996; McNeill, 1992; 2000; Kita, 2003). McNeill’s (1992) work on the apparently idiosyncratic, context-bound forms of coverbal gesturing is especially relevant for our purposes: it shows that in adults (but also in older children, from age 4–5 onward) different types of content-loaded gestures (described primarily as “iconics” and “metaphorics”) are used productively, meaningfully and temporally integrated with speech, to articulate meaning in different ways across the vocal and gestural modality. A question that has been raised, which we will also explore in this paper, is whether children’s early gestures are meaningfully and temporally integrated with speech, as in adults, or rather become an integrated system with development (McNeill, 1992; Butcher & Goldin-Meadow, 2000; Goldin-Meadow and Butcher, 2003; Capirci, Contaldo & Volterra, 2003). We believe that in order to appropriately approach this question it is necessary to ascertain more accurately than it has been done thus far the types of gestures and words children use in their early utterances, and the information such utterances convey. In particular, we think it is necessary to distinguish deictic from content-loaded “representational” elements. A distinctive feature of deictic elements, whether they are expressed by words (e.g. demonstratives or locatives such as “this”, “there”) or by gestures (e.g. the prototypical POINT1 gesture, produced with the index finger extended to direct someone’s attention toward something in the environment), is that their interpretation heavily or entirely depends upon contextual information. The referent of a POINT, much as the referent of a word like “this” cannot be identified without referring to the physical context of utterance (or also to the linguistic context, depending upon the abstraction of the texts produced — see for example McNeill’s [1992] observations on abstract pointing in narrative texts, or Lyons’ [1977] observations on textual deixis). Although deictic elements are of crucial importance for human

Gestural-vocal deixis and representational skills in early language development

communication and language (see among others Kita, 2003; Lyons, 1977), they cannot and should not be easily assimilated to content-loaded, representational elements such as a word meaning “dog” or “mommy”, or more or less conventional gestures that may be used to express such concepts as “TALL”, or “GOOD”. Regardless of their arbitrariness (in the case of words) or iconicity (in the case of many gestures), these representational elements can only be interpreted referring to symbolic conventions that must be shared by the word or gesture producer and his/her interlocutor. An appropriate description of the intricate interrelationship between deictic and representational elements in language and communication is beyond the scope of this paper. For the present discussion it is relevant to note that, in adult production, the vast majority of gestures appear to fall in the representational class, while deictic gestures appear in markedly smaller numbers (McNeill, 1992). Furthermore, in adult speakers representational gestures are most frequently used to “add” or “articulate” the information provided in speech. As we will try to demonstrate throughout this paper, children’s early gestures, and gesture-word utterances seem to exhibit different features. The present study stems from earlier work described by Capirci, Iverson, Pizzuto and Volterra (1996), Iverson, Capirci & Caselli, 1994; Pizzuto (2002), Pizzuto, Capirci, Caselli, Iverson & Volterra (2000) (for a summary and discussion framed in a broader perspective see Volterra, Caselli, Capirci & Pizzuto, 2005). These studies provided evidence on the gestural and vocal repertoires and on the relationship between gestural, vocal, and gestural-vocal utterances in the spontaneous production of twelve Italian children examined at 16 and 20 months — two developmental junctures marking the onset of two-word speech. One of the findings provided by this earlier work is particularly relevant for the present discussion: the indication that there are distinct developmental patterns in children’s production of deictic vs. representational elements in the gestural as compared to the vocal modality, and in one as compared to two-element utterances (Pizzuto, 2002; Volterra et al., 2005). These studies are important because, due to the relatively large sample of children examined, they allow us to evaluate the generalizability of specific developmental trends more reliably than studies based on few subjects (as it is often the case with many longitudinal studies, including the present one). They also include valuable information on individual differences as well as group tendencies. However, the evidence available is not sufficient to clarify several, potentially important aspects of the phenomena under investigation. We note here some of the major points that, in our view, have not been sufficiently investigated.

181

182

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

First, since the data available are limited to two age points (16 and 20 months), they obviously disregard developmental processes that may be relevant but take place prior to, or following those age points. Second, the data do not include information on the temporal relationship between gestural and vocal elements in children’s crossmodal utterances — a topic of particular interest for ascertaining major similarities and differences between the child’s and the adult gestural-vocal system. Third, these studies have not explicitly investigated the development of gestures accompanied by vocalizations (rather than words). The relevance of this behavior certainly needs to be assessed in order to achieve a more comprehensive understanding of the relationship between the gestural and vocal channel in the developmental process. Finally, the data that have been published (e.g. Capirci et al., 1996; Volterra et al., 2005) provide only limited information on the use of deictic vs. representational elements in one-element utterances though, as mentioned, interesting asymmetries have been reported (Pizzuto, 2002; Volterra et al., 2005). The present study aims to provide new longitudinal evidence on these aspects of the developmental process that have not been thoroughly examined in previous work. Towards this end, we examine the gestural, vocal and gestural-vocal productions of four typically developing children, observed longitudinally from 10–12 to 24–25 months. We focus on the components of children’s utterances (deictic vs. representational), the information they conveyed, and the temporal relationship between gestures and vocalizations or words that were produced in combination. 1.1 Methodological issues It is necessary to discuss some of the major methodological issues concerning the criteria used for classifying children’s gestural and gestural-vocal productions, and for comparing these with vocal productions. As noted by different authors (see among others Erting and Volterra, 1990; Capirci et al., 1996; Mayberry & Nicoladis, 2000; Guidetti, 2002; Volterra et al., 2005) the classificatory criteria and terminology adopted in much developmental literature on gesture are not homogeneous, and often differ from those used in research on adult gesture (Mayberry and Nicoladis, 2000). Coupled with differences in theoretical frameworks, focus, methodology, data sets, this heterogeneity renders often very difficult to assess the comparability and generalizability of the findings that are reported. Within the limits of this paper we cannot pursue an appropriate discussion of the different proposals that have been formulated. We will only point out some of the most relevant similarities and differences between our terminology and classificatory criteria and those used in other developmental studies that are partially comparable with ours.

Gestural-vocal deixis and representational skills in early language development

In this paper we draw a distinction between two broad classes of gestures that have been observed in children’s production: (1) deictic gestures, including the prototypical POINT but also other gestures (see 2.2 below); (2) a heterogeneous variety of “content-loaded” gestures, made with hand or body movement, or facial expressions that come to be associated with identifiable, relatively stable meanings across different contexts of production (e.g. flapping the hands for “BIRD”, opening and closing the mouth for “FISH”, conventional gestures such as headshakes for “NO”, headnods for “YES”, or culturally-specific gestures such as the Italian “GOOD”, made with closed fist, tip of the index finger touching the check with a short rotating movement). The gestures in this second, heterogeneous class have been labeled with a variety of different terms, including “referential” or “symbolic” (Caselli, 1990: Caselli & Volterra, 1990), “characterizing” or “iconic” (Goldin-Meadow & Morford, 1985; 1990; Goldin Meadow & Butcher, 2003), “signs” (Goodwin & Acredolo, 1993). Throughout this paper we use the term “representational” (after Iverson et al., 1994; Capirci et al., 1996), but the reader should be aware that this term is at least partially controversial (see 2.2 for additional remarks on this issue). Differences among authors are not limited to terminology but regard also, and more importantly, the criteria used for assigning a given gesture to one or the other of these two broad categories, and for interpreting the meaning of gestures in context. For example, in a longitudinal study of three children observed from 10–12 to 28–30 months, Goldin-Meadow and Morford (1990: 252), using Bloom’s (1970) method of rich interpretation for children’s early words and Fillmore’s (1968) case grammar, describe as a characterizing gesture, which is labeled GIVE and considered a predicate, a flat hand extended palm up gesture used to request objects. In the analytic and coding model adopted for the present study (see below and Section 2.2) this gesture would be classified as a deictic gesture, and labeled as REQUEST. As mentioned above, several studies have shown that, prior to producing twoword utterances children use meaningful gesture-word combinations. The production of these crossmodal utterances appears to be a crucial developmental step during the transition from one- to two-word speech (Goldin-Meadow & Morford, 1985; 1990; Morford & Goldin-Meadow, 1992; Butcher & Goldin-Meadow, 1993; 2000; Capirci et al., 1996; Pizzuto et al., 2000; Goldin-Meadow & Butcher, 2003; Volterra et al., 2005). All of these studies have also highlighted (with minor or greater emphasis) that the gestures children most frequently combine with words are deictics. However, relevant differences in the way gesture-word combinations have been analyzed (and compared to two-word utterances) need to be noted. These differences are primarily linked to the “richer” or “more conservative” interpretation assigned to deictic gestures and, consequently, also to the gesture-word combinations that include deictic gestures.

183

184

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

For example, Goldin-Meadow and Morford (1985; 1990) were the first to propose an interesting distinction between two different types of gesture-word combinations children produced during the transition from one- to two-word speech: a complementary and a supplementary type. The complementary type was exemplified by such utterances as 〈POINT at glasses while saying “glasses”〉. These combinations were described as referring to the same semantic element, with deictic gesture merely (and redundantly) reinforcing the meaning of the co-occurring word. The supplementary type was exemplified by utterances such as 〈POINT at glasses while saying “out”〉 to request that glasses be taken out of the case. In these utterances, the gesture and the word were considered to refer to two distinct semantic elements: the POINT was considered semantically and functionally equivalent to the noun “glasses”, representing the patient case, while the word “out” symbolized the act predicate (Goldin-Meadow and Morford, 1990, p. 253 — see also Morford & Goldin-Meadow, 1992; Butcher & Goldin-Meadow, 1993). In more recent work, the same two types of gesture-word combinations have been classified by Butcher and Goldin-Meadow (2000) and Goldin-Meadow and Butcher (2003) in different terms, distinguishing between “combinations that convey the same referent” (e.g., POINT at a dog accompanied by the word “dog”), and “combinations in which the gesture conveys a referent that differs from that conveyed in speech” (e.g., POINT at a pair of glasses and saying “mommy”). Regardless of the terms with which it has been labeled, this distinction appears to be very important from a developmental perspective: Butcher and Goldin-Meadow (2000), and Goldin-Meadow and Butcher (2003) have provided data showing that the combinations with “different information” (formerly supplementary), but not those with “same information” (formerly complementary) are significantly related to, and predict the onset of two-word utterances. In agreement with Capirci et al. (1996), we have found that the distinction between complementary and supplementary utterances originally proposed by Goldin-Meadow and Morford (1985; 1990) is indeed useful for characterizing different types of gesture-word combinations produced by children, provided it is amended of unwarranted assumptions in interpreting the meaning of deictic gestures, modified as needed, and extended to characterize the informational content of two-word and two-gesture utterances, in order to allow appropriate comparisons. Capirci et al. (1996) have proposed a more conservative analysis of the information vehiculated by deictic gestures within a more general analytic model. This model is adopted in the present study, and it is described in detail in Section 2.2 below. For the purposes of the present discussion, it is important to note that in this model deictic gestures are never assimilated to nouns and/or verbs that occur in speech (at most they are considered comparable to deictic words). This

Gestural-vocal deixis and representational skills in early language development

methodological decision has important consequences for coding gesture-word combinations, especially with respect to the distinction between complementary vs. supplementary utterances. Perusal of examples provided by Goldin-Meadow and her colleagues as illustrations of gesture-word combinations conveying different information (i.e., supplementary) shows that many such examples could be classified in different terms if one were to adopt Capirci et al.’s (1996) model. For example, Goldin-Meadow and Butcher (2003) include in the “different information” class the combination: 〈POINT at cow while saying “moo”〉. It is plausible to hypothesize that this classification is based on the assumption that the POINT in this utterance can be likened to a noun for the “cow” referent, while the co-occurring (onomatopoeic) word is taken as a predicate-symbol for the noise produced by the animal referred to. In Capirci et al.’s (1996) model, combinations of this sort are classified (more conservatively) as complementary utterances in which the pointing gesture singles out the intended referent and the co-occurring word provides a name for it. Complementary utterances are also assigned a very different value in the framework proposed by Capirci et al. (1996) as compared to the framework proposed by Goldin-Meadow and her colleagues. Take for example combinations such as 〈POINT at bubbles while saying “bubbles”〉, classified as complementary by Goldin-Meadow and Morford (1990, p. 258), or 〈POINT at dog while saying “dog”〉, classified as “same information” by Goldin-Meadow and Butcher (2003). Combinations of this sort are classifiable as complementary utterances within Capirci et al.’s model, but with a crucial difference concerning the information value assigned to the deictic gesture: the “pointing” in these utterances does not play a “redundant” role because it is not equivalent to the nouns “bubbles” or “dog”. On the contrary, the pointing gesture plays a truly complementary role, disambiguating the referents intended by the child. In Capirci et al.’s framework, utterances of this sort are comparable to vocal-only utterances such as “that bubbles” or “that dog”. Although certainly a direct inspection of the original data would be required to assess the validity or generalizability of different analytic and coding methodologies, all of these observations indicate that the analysis of gesture-word combinations produced by children is certainly not an easy task, and that different interpretations can be assigned (and, consequently, different results be obtained), depending upon the implicit or explicit assumptions that are made on the communicative function or information value of the behaviors under examination. An appropriate examination of the relationship between gestures and words in children’s early language crucially requires uniform, explicit criteria for classifying and comparing the communicative, symbolic and/or linguistic status of children’s productions, irrespective of the modality in which they are articulated (Erting &

185

186

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

Volterra, 1990; Volterra et al., 2005). Although many problems still remain to be solved, in the present study, as in Capirci et al. (1996), we tried to take in due account these methodological concerns by using as much as possible comparable classificatory criteria for children’s gestural, gestural-vocal, and vocal productions.

2. Method 2.1 Participants and data collection procedure Four children (three males, one female, two first born, two second born) participated in the study. All children came from upper-middle class Roman families and, to the extent that it could be ascertained prior to the study, all could be classified as typically developing children. The children were observed at home, during videotaped sessions lasting on average 45 minutes, while they spontaneously interacted with their mothers (occasionally also with their father or other caregivers) in three different contexts: play with new examples of familiar objects (a set of toys provided by the experimenter), play with familiar objects, and during a meal or snacktime. Observations sessions were scheduled monthly, but as it often happens in longitudinal studies of this sort, this schedule could not always be followed and some sessions were skipped. The data collected include from 9 to 15 monthly records for each child in the developmental period comprised between 10–12 and 24–25 months. 2.2 Coding and analysis All communicative gestures, vocalizations and words were transcribed from the videotapes. The communicative status of children’s gestures and words was assessed following the criteria proposed by Thal & Tobias (1992), Iverson et al. (1994), Capirci et al. (1996): gestures and words were considered communicative if they were accompanied by eye contact with the child’s interlocutor, vocalizations and/or other clear evidence of an effort to direct or maintain the attention of the child’s interlocutor. The coding scheme adopted for the present study, illustrated in Table 1 with examples from the children’s production, is essentially the one proposed by Capirci et al. (1996), with some additions and modifications concerning the categories of deictic vocalizations and bimodal utterances that were found to be useful in subsequent work (Corsi, 1998; Pizzuto et al., 2000; Pizzuto, 2002). We classified as one-word or one-gesture utterances all words or gestures that occurred in

Gestural-vocal deixis and representational skills in early language development

Table 1. Coding scheme and notational conventions with illustrative examples of the children’s gestural and vocal productionsa (a) Types of gestures, words, vocalizations in children’s utterances Deictic GESTURES (DG) POINT, SHOW, REQUEST Words (dw) qua, là, questo, quello, eccolo, io, tu, mio, tuo 〈this, that, here, there, here_it_is, I, you, mine, yours〉 vocalizations (dv) eh, de, tete, hm, ah Representational GESTURES (RG) FISH, WORM, HORSE, GOOD, TALL, SLEEP NO, YES, ALL_GONE, BYE_BYE Words (rw) mamma, pappa, piccolo, guarda, no, sì più, ciao, mommy, food, small, look, no, yes, all_gone, bye_bye〉 (b) Two element utterances: modality, information conveyed, components b.1. Bimodal and Equivalent (G=v/w) b.1.1 DG = dv SHOW (keys) = de b.1.2 RG = rw NO = no 〈no〉 BE-SILENT = zitto 〈be_silent/quiet〉 b.2. Crossmodal (G-w): Complementary (&) and Supplementary (+) b.2.1 DG&rw POINT (ant) & rommica 〈formica: ant〉 b.2.2 DG & dw POINT (keys) & quette 〈= queste: these〉 b.2.3. DG+rw POINT (at toy) + bello 〈beautiful〉 b.2.4. DG+dw REQUEST (bubbles) + qua 〈here〉 b.2.5. RG+dw tete 〈=questi: these〉 + ALL_GONE. b.2.6. RG+rw cattia 〈cattiva: naughty〉 + SCOLD b.3. Vocal (w-w): Complementary (&) and Supplementary (+) b.3.1. dw&rw eccolo&Titti 〈here_it_is +Tweety (bird)〉 b.3.2 rw+rw tante+bolle 〈many+bubbles〉 b.3.3. rw+rw gadda+pipa 〈guarda: look+pipe〉 b.3.4. dw+rw eccio+mamma 〈questo: this + mommy〉 b.3.5. dw+dw io+quello 〈I+that〉 b.4. Gestural (G-G): Complementary (&) and Supplementary (+) b.4.1 DG&RG POINT (to picture of worm) & WORM. b.4.2 DG+DG POINT (to person) + POINT (to book) a

Adapted from Capirci et al. (1996). The different categories of gestures, words, vocalizations and twoelement combinations of gestures and/or words or vocalizations are described in the text. Abbreviations used throughout the text for each major type of gestures and words are given in parentheses (i.e. DEICTIC GESTURES = DG). Gestures are represented by English labels, in CAPITALS. Vocalizations and words are given in lower case letters, in an orthographic rendition of the form produced by the children, followed by the adult target word (when appropriate) and its English translation, in angle brackets. An underline character indicates single gestures or words which require more than one English word to be labeled (e.g. ALL_GONE for the “palms up” gesture, “eccoli’ 〈here_they_are〉).

187

188

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

isolation. Gestures and vocalizations or words that appeared to be meaningfully related were classified as two- or multi-element utterances (see below). As shown in Table 1, all gestures and words occurring in children’s utterances were distinguished in two major categories: deictic and representational. Deictic gestures (DG) all fulfilled the basic function of “drawing the interlocutor’s attention towards something in the environment”, and included three types: POINT (index finger or in few cases also full hand extended towards an object, location or person), SHOW (holding up an object in the adult’s line of sight), REQUEST (extending the arm toward an object, location or person, sometimes with a repeated opening and closing of the hand). Deictic words (dw) included demonstrative and locative expressions and personal and possessive pronouns as defined in linguistic terms (Lyons, 1977). Deictic vocalizations (dv) comprised idiosyncratic vocal productions, highly variable in forms, that frequently accompanied DG and appeared to function as vocal “generic directors of attention”, i.e. deictic signals. Representational gestures (RG) included some iconically motivated gestures (e.g. a opening and closing of the mouth for FISH, a repeated bending of the index finger at the middle joint, with an outward movement of the whole hand, for WORM, a back and forth rocking of the whole body, mimicking a horse-riding motion, for HORSE, raising the hand and arm up high for TALL, joining the hands (palms inward) and reclining the head on them, closing the eyes, for SLEEP). Other RG were Italian conventional gestures such as the one for GOOD previously described, or the one for “CIAO” (HI/BYE_BYE: opening and closing the four fingers, thumb extended). Still others were conventional (most likely universal) gestures observed in children across cultures: shaking the head sideways for NO, nodding for YES, turning and raising the palms up for ALL_GONE. Representational words (rw) included content words that, in the adult language, are assigned to the class of common and proper nouns, adjectives, verbs, adverbs (e.g. mommy, food, small, look). Affirmative and negative expressions (e.g. ‘yes’, ‘no’, all_gone’), interjections (e.g. ‘bravo!’), greetings (e.g. ‘ciao’ = bye_bye), adverbials and prepositions such as “up”, “down” (fairly rare in the children’s production), and onomatopoeic or sound symbolism forms that are frequent in early speech (e.g. ‘brum-brum’ for ‘car’), were also included in the class of rw. As remarked in Capirci et al. (1996), it is clear that some of the items classified as representational (whether gestural or vocal) may have an uncertain symbolic status even in adult communication, and more detailed classifications may be needed for capturing relevant aspects of the children’s developing semantic system (see for example Iverson et al., 1994). It has also been rightfully observed that the category of representational gestures as defined in Capirci et al. (1996) and here is too heterogeneous to be considered as a whole and that, in particular, conventional

Gestural-vocal deixis and representational skills in early language development

gestures (e.g., YES, NO, ALL_GONE) should be clearly distinguished, and examined separately from iconic, child-specific gestures (Guidetti, 2002). The broad distinction we make between deictic and content-loaded i.e., more representational elements is motivated primarily by the need to use categories that can be applied as consistently as possible to the analysis of both gestures and words. This distinction is thus intended to capture some, but by no means all of the most relevant aspects of the evolving relationship between gestural and vocal productions in children’s early language. Finer distinctions within both the RG and the rw classes certainly will need to be drawn in future work. Children’s two- and more elements utterances were distinguished in different types according to the modality in which they were produced (gestural and/ or vocal), the information conveyed, the component elements of which they were made (i.e., DG, RG, dv, dw, rw). The distinctions we made is best illustrated with reference to two-element utterances, as shown in Table 1 (b). Two-element utterances were coded as “bimodal and equivalent” (see Table 1, b.1) when they consisted of a gestural and a vocal element that conveyed essentially the same information. Two subtypes were distinguished: one, RG=rw, already described by Capirci et al. (1996), consisted of two representational elements, as in the combination of the RG BE-SILENT (closed fist, index finger extended and brought up vertically to touch the lips) with the rw meaning “be-silent”. As noted by Capirci et al. (1996: 669) these utterances can be characterized as “bimodal oneelement utterances”, rather then true combinations of two elements. This characterization can be extended to the second subtype of bimodal equivalent utterances we distinguished in the present study: those consisting of a deictic gesture and a deictic vocalization (DG=dv), as in the SHOW gesture combined with an emphatic vocalization “de!” to direct the adult’s attention towards a toy the child was holding. The productions we coded as DG=dv have been described in many early studies of deictic gestures in early development, where it is frequently observed that pointing is often accompanied by vocalizations (e.g., Bates, 1976), and indeed the presence of vocalizations is one of the criteria used for assessing the communicative status of the gestures children produce, as mentioned above. More recently, Butcher and Goldin-Meadow (2000) have noted “meaningless vocalizations” occurring with gestures during the transition from single- to multiword speech. In our analysis we found that, although devoided of any well defined meaning, most vocalizations co-occurring with manual DG played a basic “attention directing”, deictic function that in principle was very similar (i.e. equivalent) to that fulfilled by the co-occurring manual gesture. These vocalizations in a sense exalted (via vocal means) the deictic operation performed by the child, making her request

189

190

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

of attention more salient than it would have been if it were performed “silently”. We believe that charting the developmental course of such vocalizations that accompany deictic gestures is relevant for a clearer understanding of the relationship between gestural and vocal means of expression in early language development. The remaining utterance types described in Table 1 all consisted of crossmodal (i.e. gestural and vocal), vocal only, or gestural only combinations of two elements. Following Capirci et al. (1996), regardless of the gestural and/or vocal elements implicated, two major classes of utterances were distinguished according to the information they conveyed: complementary and supplementary. Complementary utterances always referred to a single referent. The distinctive feature of such utterances, marked by the “&” character placed between the two combined elements, was the presence of one deictic element (gestural or vocal) which provided non-redundant or “complementary” information for singling out or disambiguating the intended referent. For example, in the crossmodal utterance under b.2.1 in Table 1 the DG POINT (to a real ant) accompanied by the rw for “ant” was a relevant element for singling out the intended referent “ant”. A similar function was played by the dw “here_it_is” accompanied by the rw for “Tweety” in the vocal utterance referring to a picture of Tweety bird under 3.1, or by the DG POINT that accompanied the RG for WORM in the gestural utterance under 4.1 (referring to an illustrated worm). A different, yet still complementary relationship was observed when the two combined units were both deictic, as in the crossmodal combinations of a DG and a dw in 2.2 in Table 1, where the DG provided the necessary information for identifying, in context, the intended referent “keys” of the otherwise ambiguous dw “these”. Supplementary utterances referred to either (more often) a single referent, as in examples 2.3–2.6 and 3.2 in Table 1, or to two referents (see examples 3.3–3.6 and 4.3 in Table 1). In all cases, however, each of the combined elements added information to the other one — hence the “+” character used for notating this type of utterance. The prototypical case of supplementary utterances were vocal combinations of two representational elements (e.g. rw+rw, see examples 3.2–3.3 in Table 1), but other crossmodal or gestural combinations of representational and/ or deictic elements occurred (e.g., POINT at toy and saying “beautiful”, as in example 2.3, or POINT to the interlocutor and then to a book, as in example 4.3, to suggest that the interlocutor looked at a specific book). In broad terms, regardless of modality differences, most (albeit not all) complementary utterances can be likened to forms of naming, while most (albeit again not all) forms of supplementary utterances can be assimilated to forms of predication (Pizzuto, 2002; Volterra et al., 2005). As noted by Capirci et al. (1996), it is clear that, beyond their similarities, the various subtypes of supplementary and

Gestural-vocal deixis and representational skills in early language development

complementary utterances differ at both the symbolic and the semantic level. For example, the degree of “symbolic definition” of a vocal utterance such as 3.3 in Table 1 〈“look (at the) pipe”〉 markedly differs from that of a gestural utterance such as 4.3 in which the relationship between the intended referents implicated is merely suggested, but by no means fully specified by the sequence of two DG of example 4.4 in Table 1 (roughly interpretable as “you this”). Our analysis aims to capture the major similarities between crossmodal (or gestural only) and vocal utterances. The analysis of bimodal and crossmodal utterances also included an examination of the temporal relationship between the gestural and vocal elements occurring in such utterances. Perceptual criteria were used to distinguish between synchronous and asynchronous combinations of gestures and vocalizations or words. Combinations were defined synchronous when, upon a frame-by-frame analysis of the videotapes, we could not perceive any time-interval between the gesture and the vocalizations of word with which it was combined. Conversely, combinations were defined asynchronous when we could perceive a time interval between the combined elements. The time interval between the elements of asynchronous utterances was then computed in hundredths of seconds. 2.3 Reliability All videotapes were transcribed and coded independently by two trained coders. After independent coding, a first measure of intercoder reliability was computed, counting the number of disagreements over the total number of communicative utterances identified by the two coders. Agreement ranged between 85% and 100% depending upon the coding category. A second, independent measure of intercoder reliability was assessed on a subsample of the data (approximately two hours of videotapes), which were reviewed by a third coder. Mean agreement with previous coders was 93% across coding categories.

3. Results and discussion All children produced a rich variety of one- and two- or more element vocal, bimodal and crossmodal utterances, with relevant individual differences that were, on the whole, comparable to those reported in other studies on early gestural-vocal development (see among others Iverson et al., 1994; Capirci et al., 1996; Butcher & Goldin-Meadow, 2000; Guidetti, 2002; Volterra et al., 2005).

191

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

In the present work we focus on one- vs. two- element utterances. We describe the developmental patterns observed reporting the number of tokens (rather than relative percentages) of the different types of utterances produced. This choice is motivated by the consideration that children’s early productions are often limited to few utterance tokens. Relative percentages computed over few tokens may provide a somewhat distorted view of the patterns under analysis, and it may be more difficult to assess their significance. Figure 1 summarizes the major developmental trends in the children’s production of single gestures (1G), single-words (1 w), combinations of a single gesture with either a vocalization or a word (G-v/w), and two-words (w-w). Individual differences are immediately apparent. Two children (GAL and MAR) produced a markedly larger number of utterance tokens, especially of 1 w, compared to the other two (ALE and GIO). One child, GIO, exhibited a delayed developmental pattern: unlike the other three children, who began producing w-w utterances at the canonical age of 17–18 months, this child remained at the oneword stage throughout the observation period, and produced only one w-w token at 24 months. Despite and beyond individual differences, common developmental trends can be identified. All children produced 1 G and G-v/w utterances prior to (GAL GAL

250 N of tokens

MAR 1G 1w G-v/w w-w

200 150 100

300

50 0

200 150 100 50 0

10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 Ag e ALE

250 200 150 100 50 0

12

13

15

16

18

19 20 Ag e

21

22

23

24

GIO 1G 1w G-v/w w-w

1G 1w G-v/w w-w

300 250 N of tokens

300

1G 1w G-v/w w-w

250 N of tokens

300

N of tokens

192

200 150 100 50

12 13 15 16 17 18 19 20 21 22 23 24 25 Ag e

0

10

11

12

14

15 Ag e

18

19

23

24

Figure 1. Number of utterances consisting of single gestures (1 G), single words (1 w), combinations of gestures with vocalizations or words (G-v/w), and two words (w-w) in each child’s production at different ages

Gestural-vocal deixis and representational skills in early language development

and GIO), or at the same time as (MAR and ALE) they began producing 1 w utterances. All children produced G-v/w utterances well before (from 4 to 8 months) w-w utterances. Thus, across children gestural-vocal utterances were the first type of two-element utterances produced, in agreement with what found in previous studies. Two-gesture combinations (G-G) occurred in such a small number (from 0 to 4 tokens in all children’s total production) that they were ignored for descriptive purposes. All children produced both 1 G and 1 w utterances. However, in the three children with a more regular developmental pattern 1w utterances became much more numerous than 1 G utterances either at the one-word stage (in MAR), or following the onset of two-word utterances (in GAL and ALE). 3.1 One-element gestural and vocal utterances Figure 2 illustrates the distribution of deictic vs. representational elements in children’s one-element utterances. The data in Figure 2 show that, in the vocal modality, the earliest and more productively used utterances consisted of representational words (1 rw), and these increased across sessions. Single deictic word utterances (1 dw) appeared later in G AL 300

1 1 1 1

250 200

300

rw DG RG dw

150 100

1 DG 1 RG 1 dw

200 150 100 50

50

0

0

12

10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 Ag e AL E 300

13

15

16

18

19 Ag e

20

21

22

23

24

G IO 300

1 rw

250

1 DG

200

1 dw

1 RG N of tokens

N of tokens

1 rw

250 N of tokens

N of tokens

MAR

150 100 50

1 rw

250

1 DG

200

1 dw

1 RG

150 100 50

0

0 12 13

15 16 17

18 19 20 Ag e

21 22 23

24 25

10

11

12

14

15

18

19

23

24

Ag e

Figure 2. Number of one-element utterances consisting of single representational words (1 rw) or gestures (1 RG) vs. single deictic words (1 dw) or gestures (1 DG) in each child’s production at different ages

193

194

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

all children (from 16 to 18 months), and were much less frequent. In the gestural modality, a less uniform pattern was noted: 1 DG and 1 RG utterances were both used by all children, and appeared at the same time (in three children), or within a one-month interval (in MAR 1 RG appeared after 1 DG utterances). There were no marked differences in the frequency with which 1 DG and 1 RG utterances were used by individual children. Examining the children’s production patterns across observation sessions we found that, at a global level, within the class of utterances with representational elements, the vocal modality prevailed over the gestural one: the proportion of 1 rw ranged from 72% to 99% (depending upon the individual child), compared to 1% to 34% for 1 RG utterances. The same analysis applied to one-element deictic utterances evidenced that these were preferentially expressed gesturally (though with a wider range of variation): 1 DG utterances constituted from 52% to 98% of all one-element deictic utterances, while the proportion of 1 dw utterances varied from 2% to 48%. The more productive use of 1 rw utterances of course is not surprising in the light of all that is known on typically developing children. However, the productive and persisting use of gestural deixis in one-element utterances, along with the later development of vocal deixis, points out the relevance of gestures for vehiculating this fundamental function of human communication. 3.2 Two-element utterances Figure 3 describes the development of the major types of two-element utterances identified in the four children’s production. In this Figure bimodal equivalent utterances (G=v/w) are distinguished from true crossmodal combinations (G-w — see Table 1). It can be seen that, beyond individual differences, in all children the earliest type of utterances to appear are the bimodal equivalent ones. These are very frequent, especially in GAL and GIO, but decrease (more or less rapidly, depending upon the child) with development. True crossmodal combinations of a gesture and a word (G-w) follow a different developmental pattern. In two children (MAR and ALE) G-w combinations appear very early (12 months), albeit in very small numbers (one or two tokens), along with bimodal utterances. In the other two children (GAL and GIO) these combinations appear later, at 14 months, after bimodal utterances. In all children, G-w combinations precede w-w utterances, appear to increase in several observation sessions both prior to, and following the appearance of w-w utterances, and remain the most productive type of two-element utterances used through all but three observation sessions. The only exception to

Gestural-vocal deixis and representational skills in early language development

GAL

MAR

N of tokens

G=v/w G-w w-w

N of tokens

100 90 80 70 60 50 40 30 20 10 0

10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 Age

100 90 80 70 60 50 40 30 20 10 0

ALE

12

13

15

16

18

19 20 Ag e

21

22

23

24

GIO G=v/w G-w w-w

12 13 15 16 17 18 19 20 21 22 23 24 25 Age

N of tokens

N of tokens

100 90 80 70 60 50 40 30 20 10 0

G=v/w G-w w-w

100 90 80 70 60 50 40 30 20 10 0

G=v/w G-w w-w

10

11

12

14

15 Ag e

18

19

23

24

Figure 3. Number of two-element utterances of different types in each child’s production at different ages: bimodal equivalent (G=v/w) vs. crossmodal gesture-word (G-w) and vocal two-word (w-w) utterances

this pattern were GAL’s and MAR’s sessions at 24 months when w-w utterances outnumbered G-w utterances, and ALE’s last observation, at 25 months, when he produced only w-w, and no G-w combinations. Concerning the category of bimodal equivalent utterances, it must be noted that, across children and observation sessions, the vast majority of these utterances consisted of DG=dv (from 76% to 97% of all bimodal equivalent utterances). Equivalent RG=rw utterances (see Table 1) were in much smaller proportions (from 3% to 24%). This finding on the relatively marginal use of redundant, equivalent combinations of two representational elements is in agreement with what found by Capirci et al. (1996). The data on DG=dv utterances provide new information on the strong links between gestural and vocal deixis (albeit realized via vocalizations, not deictic words). These links also emerged from considering the proportion of DG=dv utterances compared to that of 1 DG utterances in the children’s total production. We found that the proportion of DG=dv ranged from 50% to 62% compared to 38% to 50% for 1 DG utterances. Thus, at a global level, half or more of the children’s total production of “generically deictic” gestures were accompanied by deictic vocalizations.

195

196

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

The key role of DG in children’s development, however, is most evident if we consider the more sophisticated use children make of DG in their crossmodal utterances, and how these compare to combinations of words which, in our model, express comparable information. Figure 4 and 5 below illustrate the development of the different subtypes of G-w utterances, distinguished in the complementary and supplementary classes, compared to their functionally equivalent w-w utterances (see Table 1). Both figures illustrate only the combination subtypes that occurred in an appreciable number. Combinations that were produced only in very few tokens (e.g. crossmodal RG+rw, RG+dw, dw+dw — see Table 1 and below) were included in the total number of combinations illustrated in Figure 3, but not in the graphs of Figures 4 and 5. The data in the Figures above show that virtually all the G-w utterances produced by the children consisted of a DG with a rw or a dw. These data clearly suggest that DG, not RG, are the key gestural element children use for articulating the informational content of their crossmodal utterances. It is in our view of particular interest that crossmodal combinations of two representational elements (Rg+rw) were virtually absent from the children’s production, and so were RG+dw combinations (total N of tokens across children: 10 and 2, respectively). It is significant that these combination subtypes were absent in the production not only of the child who used very few 1 RG utterances (MAR — see Figure 2), but also of the two children who produced a fairly large number of 1 RG utterances, such as GAL and ALE. In contrast, it is evident from Figure 5 that vocal combinations of two representational elements (rw+rw) were produced in appreciable number, from age 17–18 onward, by all children but one (GIO). These longitudinal data on the extremely sparse use of RG in two-element, supplementary utterances, and of prevailing use of rw+rw combinations, confirm and extend previous findings described by Capirci et al. (1996), Pizzuto et al. (2000) and Pizzuto (2002). Focusing on the complementary utterances in Figure 4, it can be seen that, with some negligible exception, the ones that were more frequent in all children were DG&rw combinations. This is the crossmodal utterance type we likened earlier to a form of nomination. It can also be seen that while DG&rw utterances were used frequently, and appeared early (from 12–14 months on) in all children, their corresponding vocal counterpart (dw&rw) appeared considerably later (from 20 to 24 months), in three children only, and in very small number. The second subtype of crossmodal complementary utterances, DG&dw, the one that is perhaps more common in adult usage, also appeared later in development (from age 15 onward), and were much less frequent than the DG&rw type.

Gestural-vocal deixis and representational skills in early language development

80

GAL

70

N of tokens

40 30

50 40 30

20

20

10

10

80

0 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 Ag e

80

DG&rw DG&dw dw&rw

60 50 40 30 20

13

15

16

18

19

20

21

22

23

24

GIO

70

DG&rw

60

DG&dw

50 40 30 20

10 0

12

Ag e

ALE

70

DG&rw DG&dw dw&rw

60

50

0

MAR

70

N of tokens

N of tokens

60

N of tokens

80

DG&rw DG&dw dw&rw

10 0

12 13 15 16 17 18 19 20 21 22 23 24 25

10

11

12

14

Ag e

15

18

19

23

24

Ag e

Figure 4. Number of complementary utterances of different types in each child’s production at different ages: crossmodal (DG&rw, DG&dw) vs. vocal (dw&rw). GAL DG+rw rw+rw dw+rw

N of tokens

70 60 50 40 30

50 40 30 20 10

80 70 60

DG+rw rw+rw dw+rw

60

10 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 Age

MAR

70

20 0

N of tokens

80

N of tokens

80

0

12

13

15

16

18

19 Ag e

20

21

22

23

24

ALE DG+rw rw+rw dw+rw

50 40 30 20 10 0

12 13 15 16 17 18 19 20 21 22 23 24 25 Ag e

Figure 5. Number of supplementary utterances of different types in each child’s production at different age crossmodal (DG+rw) vs. vocal (rw+rw, dw+rw).

197

198

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

Turning now to the supplementary utterances illustrated in Figure 5, a somewhat different developmental pattern can be noted. First, these utterance types were used with some (greater or minor) regularity only by three children: GAL, MAR and ALE. The more delayed child, GIO, produced only three such utterances (not included in Figure 5) at 24 months: two were crossmodal (DG+rw, DG+dw), one vocal (rw+rw). Of the three children who used them, one (MAR) produced the crossmodal and vocal types at the same age (18 months), continued to use both types with approximately the same frequency for a period of four-six months (depending upon the subtype — see Figure 5), then began producing a markedly larger number of vocal only supplementary utterances (especially rw+rw) in the last observation sessions. In the other two children (GAL and ALE), crossmodal DG+rw utterances appeared one or two months earlier than their vocal counterparts (rw+rw and dw+rw. In these children, as in MAR, dw+rw utterances were less frequent than rw+rw utterances, and both began to increase by the last sessions. Comparing Figure 5 with Figure 4 it can be seen that in GAL, MAR and ALE’s production DG+rw utterances appeared from one (GAL) to four months (MAR, ALE) after the complementary utterances DG&rw and DG&dw. This pattern also applied to the fourth child (GIO) who, as noted, produced complementary crossmodal utterances, but only two crossmodal supplementary utterances at the age of 24 months. Insofar as later appearance is an indication of greater complexity (or greater difficulty for the developing child), these data suggest that supplementary crossmodal utterances, similarly to their vocal only counterparts, may require (and/or reflect) greater cognitive-symbolic abilities than complementary crossmodal utterances. Comparing the developmental pattern of DG+rw vs. dw+rw utterances in Figure 5, it can be noted that the first preceded and/or were used more frequently than the second ones. This provides an additional indication that gestural deixis precedes and then accompanies vocal deixis. Finally, comparing Figure 4 and Figure 5, it can be seen that across children and observations, within the complementary class, crossmodal utterances are markedly more frequent than their vocal counterparts. In contrast, within the supplementary class, cross-modal and vocal utterances are produced for several months with roughly the same frequency, and vocal only, especially rw+rw utterances, prevail by the last observation sessions. The analysis of the overall frequency of crossmodal and vocal utterances, in the children’s total production, with respect to the complementary or supplementary classes, also revealed uneven patterns. Within crossmodal utterances, the complementary type was markedly more frequent (from 53 to 167 tokens, depending upon the child) than the supplementary type (from 3 to 45 tokens, depending

Gestural-vocal deixis and representational skills in early language development

upon the child). The reverse pattern held for vocal utterances: the supplementary type was much more frequent (from 64 to 141 tokens) than the complementary type (from 4 to 16 total tokens across children). We also found that within the complementary class, crossmodal utterances (DG&rw and DG&dw grouped together) were in markedly higher proportion (from 91% to 100, depending upon the child) than vocal only utterances (dg&rw: from 0 to 9%). This pattern was reversed for supplementary utterances: these were for the most vocal (from 67% to 82%, grouping together rw+rw and dw+rw), while crossmodal utterances were represented in much smaller proportions (from 18% to 33%). These data suggest that children’s crossmodal productions were on the whole somewhat biased towards conveying complementary information, while their vocal productions were biased towards conveying supplementary information. Earlier in this paper we noted that crossmodal supplementary utterances can be likened to forms of predication that are on the whole comparable to their vocal counterparts. The data discussed above suggest that, although crossmodal predication structures may appear earlier than vocal ones, they are clearly more constrained in the forms they take (as shown by the almost complete absence of the RG+rw subtype). They also appear to be markedly less frequent than their corresponding vocal utterances. Taken together, the data on both complementary and supplementary utterances indicate that, in these early stages of language development, gestures, and specifically DG, play a key, very productive role in articulating crossmodal utterances that can be likened to nomination, while vocal elements, most notably rw, appear to be the privileged means for realizing predication. 3.3 Multi-element utterances: Crossmodal vs. vocal The three children who used two-element crossmodal and vocal supplementary utterances also produced, from 19 to 21 months onward, a sizeable number of multielement utterances. The total number of tokens of such utterances ranged from 18 (in GAL) to 22 (in ALE) to 74 (in MAR). A detailed description of these utterances is beyond the scope of this paper, but we wish to note that in all three children the majority of these utterances were crossmodal productions that included a gesture, most frequently a DG, as one of their constituent elements. Computing the relative proportions of crossmodal vs. vocal only multielement utterances, we found that crossmodal productions were represented in the proportions of 55%, 77% and 79% in, respectively, ALE’s, GAL’s and MAR’s global production, while vocal-only productions constituted the remaining 45%, 23% and 21%. These data confirm earlier findings by Capirci et al. (1996: 664), and suggest that there is strong developmental continuity in the use of gestures (notably:

199

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

deictics) in children’s multielement utterances. Two utterances from MAR’s production at 24 months exemplify this pattern. Referring to a toy car whose headlights resembled two eyes, MAR first produced the utterance “locchi 〈occhi: (the) eyes〉 & POINT (at the car eyes/headlights)”. Then, shortly after, he produced the three-element utterance: “gadda 〈guarda: look)+locchi 〈(the eyes)〉 & POINT (at the car’s eyes/headlights)”. 3.4 The temporal relationship between gestures, vocalizations and words One of the aims of the present study was to ascertain whether the gestures children combined with vocalizations and/or words were realized as synchronous gesturalvocal combinations, or were rather sequenced in time, resulting in asynchronous combinations. Figure 6 shows the development of synchronous (Syn) compared to asynchronous (As) combinations across all types of two-element utterances the children produced, collapsed in the broad category of G-v/w (i.e. including both bimodal and crossmodal utterances). The data show that in three children (GAL, MAR and ALE) synchronous combinations were more frequent than asynchronous ones (especially in GAL’s and MAR’s production). There were some exceptions to this pattern: GAL’s earli120

GAL S yn As

80

120

60 40 20 0

120 100

S yn As

80 60 40 20 0

10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 Age

12

13

15

16

18

19 Age

20

21

22

23

24

GIO

ALE S yn As

120

60 40 20

S yn As

100 N of tokens

80

0

MAR

100 N of tokens

N of tokens

100

N of tokens

200

80 60 40 20

12 13 15 16 17 18 19 20 21 22 23 24 25 Age

0

10

11

12

14

15 Age

18

19

23

24

Figure 6. Number of synchronous (Syn) vs. asynchronous (As) combinations of gestures with vocalization or words in each child’s production at different ages.

Gestural-vocal deixis and representational skills in early language development

est combinations, at 10 months (N= 5) were all asynchronous, and in GAL’s as in ALE’s production there were some sessions in which synchronous and asynchronous combinations were produced in roughly the same number (e.g. GAL’s 11 months session, or ALE’s sessions at 16 and 20 months). It must also be recalled that the “0” value for both synchronous and asynchronous combinations for ALE’s session at 25 months reflects the fact that in this session the child did not produce any G-v/w combination. In the child who exhibited an overall developmental delay (GIO), a more irregular pattern was found: initially, asynchronous combinations prevailed over synchronous ones (at 10–11 months), then from 12 to 19 months the relative frequency of synchronous vs. asynchronous combinations oscillated between higher and lower values, and only in the last observation sessions (23–24 months) synchronous combinations clearly prevailed over asynchronous ones. Within asynchronous combinations, the time interval that elapsed between gestures and vocalizations or words was comprised (across children and observation sessions) between a maximum of 74 and a minimum of 21 hundredths of seconds, and a tendency towards a decrease of this time interval with development was noted. Figure 7 provides a more detailed picture of the development of synchronous vs. asynchronous combinations within the more restricted set of crossmodal (complementary and supplementary) utterances. Looking at Figure 7 we can observe that in two children, MAR and ALE, the earliest crossmodal combinations produced, from 12 to 13–15 months, were all synchronous. On the whole, across children, the largest majority of crossmodal utterances were synchronous. In two children, GAL and MAR, synchronous combinations were more frequent either from the start (MAR) or following an initial period in which synchronous and asynchronous combinations appeared in roughly the same number (GAL’s 14–16 months sessions). In the other two children, ALE and GIO, synchronous and asynchronous combinations appeared in roughly the same number for several months, and synchronous combinations became markedly more frequent in the last two to four observation sessions, along with an overall increase of the number of combinations produced — a pattern that is in part detectable in GAL’s production from 17 to 23 months. These data indicate that, aside and beyond individual differences, both combinations of gestures with vocalizations (Figure 6) and combinations of gestures with words (Figure 7) tended to be synchronous, and this tendency was more marked in sessions in which a larger number of combinations were produced. These findings are in agreement with those provided by a recent longitudinal study of three children (age range: 10–23 months) reported by Capirci et al.

201

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

GAL

MAR

50

50 S yn (&, +)

S yn (&, +) 40

As (&,+) N of tokens

N of tokens

40 30 20 10 0

50

As (&, +)

30 20 10 0

10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 Age ALE

12

13

16

18

19 20 Age

21

22

23

24

GIO

N of tokens

20 10

S yn (&, +) As (&, +)

40

As (&, +)

30

0

15

50 S yn (&,+)

40 N of tokens

202

30 20 10

12 13 15 16 17 18 19 20 21 22 23 24 25 Age

0

10

11

12

14

15 Age

18

19

23

24

Figure 7. Number of synchronous (Syn) vs. asynchronous (As) crossmodal complementary and supplementary (&, +) utterances in each child’s production at different ages.

(2003), and differ from those reported by Butcher and Goldin-Meadow (2000), and Goldin-Meadow and Butcher (2003). These authors found that, in five of the six chidden they observed there was an initial, well identifiable developmental period in which asynchronous combinations of gestures and words prevailed over synchronous ones (although synchronous combinations were on the whole more frequent than asynchronous). In the present work, as in Capirci et al.’s (2003) study, this pattern was not found, and in two of the four children we observed synchronous combinations preceded asynchronous ones. However, precise comparisons between these different studies cannot be done, due to differences in focus and analytic methodologies. More conclusive evidence certainly requires further research, and the role of individual differences needs to be ascertained.

4. Conclusion The longitudinal data we have described and discussed show that the gestural and vocal modalities appear to be meaningfully and temporally integrated from the

Gestural-vocal deixis and representational skills in early language development

earliest stages of language development, supporting the views expressed by McNeill (1992: 295 & ff.) on the strong linkage existing from the start between gesture and speech, a linkage that is also evident in the combinations of deictic gestures and vocalizations. However, and in agreement with the findings reported (on different data sets) by Capirci et al. (1996), Pizzuto et al. (2000), Pizzuto (2002), Capirci et al. (2003), the gestural and the vocal modality do not appear to contribute in the same manner to the articulation of meaning in children’s early one- and two- element utterances, and the integration of gesture and speech that is observed through the first two years of life presents features that differ from those noted in adult speech. Our longitudinal data show that representational and deictic elements are not distributed in comparable manner in the gestural and in the vocal modality, and in one as compared to two-element utterances. In the gestural modality, there is a clear prevalence of deictic over representational elements. The gestures children employ most frequently in meaningful combinations with words are DG, not RG, a pattern which is not observed in adult communication (and which is obscured if deictic elements are not clearly distinguished from representational ones). Insofar as the crossmodal complementary and supplementary combinations of DG and rw we have described can be likened to their vocal only counterpart, it is clear that these meaningful crossmodal combinations provide a substantial contribution in allowing children to “name” (more frequently) and “predicate” (less frequently). These basic functions of human language appear to develop first (and also more productively in the case of “naming”) across modalities, and only later within the vocal modality only. In agreement with earlier results, these findings underscore the relevance of gestural deixis as a primary (albeit still ancillary) device in early language development, well into the two-word stage. In the vocal modality, there is a clear prevalence of representational over deictic elements. This is most evident in the patterns of production of one-element utterances (though with relevant individual differences among the children we observed), and, more significantly, in two-element utterances encoding supplementary information, where children most frequently combine two representational words, rather than two gestures, or a representational word and a representational gesture. Taken together, these findings indicate that while the multimodal features of language are evident form the start, with a tight integration of deictic gestures with vocalizations and/or speech, in these early stages of language development children’s representational abilities in the gestural modality are considerably more constrained compared to adults or older children. Further research is needed to chart,

203

204

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

in different contexts of production, the appearance and developmental course of more sophisticated uses of representational gestures in children’s language.

Acknowledgements We gratefully acknowledge partial financial support from the European Science Foundation EUROCORES Programme OMLL, funds from the Italian National Research Council and the EC Sixth Framework Programme under Contract no. ERAS-CT–2003–980409. We also wish to thank Silvia Baldi, Ausilia Elia, Oriana La Veneziana and Elisa Nesti for their substantial help with data collection, transcription and coding, and one anonymous reviewer for helpful comments and criticisms.

Note 1. Throughout this paper labels for individual gestures are given in CAPITALS.

References Bates, E. (1976). Language and context. The acquisition of pragmatics. New York: Academic Press. Blake, J. (2000). Routes to Child Language — Evolutionary and developmental precursors. Cambridge: Cambridge University Press. Bloom, L. (1970). Language development: Form and function in emerging grammars. Cambridge, MA: MIT Press. Butcher, C. & Goldin-Meadow, S. (2000). Gesture and the transition from one- to two-word speech: when hand and mouth come together. In D. McNeill (Ed.), Language and gesture. Cambridge: Cambridge University Press, pp. 235–257. Capirci, O., Contaldo A. & Volterra, V. (2003). Il legame gesto-parola nelle prime fasi di sviluppo linguistico: una prospettiva longitudinale. Poster presented at the XVII National Congress AIP, Developmental Psychology Section, Bari, Italy, September 22–24, 2003. Capirci, O., Iverson, J. M., Pizzuto, E., & Volterra, V. (1996). Gestures and words during the transition to two-word speech. Journal of Child language, 23, 645- 673. Caselli, M. C. (1990). Communicative gestures and first words. In V. Volterra & C. J. Erting (Eds.), From gesture to language in hearing and deaf children. New York: Springer-Verlag, pp. 56–67. (1994 — 2nd Edition Washington, D.C.: Gallaudet University Press). Caselli, M. C. & Volterra, V. (1990). From communication to language in hearing and deaf children. In V. Volterra & C. J. Erting (Eds.), From gesture to language in hearing and deaf children. New York: Springer-Verlag, pp. 263–277. (1994 — 2nd Edition Washington, D.C.: Gallaudet University Press). Clark, E. V. (2003). First language acquisition. Cambridge: Cambridge University Press. Corsi, C. (1998). Lo sviluppo della deissi gestuale e vocale nell’acquisizione del linguaggio. Undergraduate Dissertation, University of Rome “La Sapienza”, Department of Psychology of Developmental Processes and Socialization.

Gestural-vocal deixis and representational skills in early language development

Erting, C., Volterra, V: (1990). Conclusion. In V. Volterra, C. Erting (Eds.), From gesture to language in hearing and deaf children. Berlin: Springer Verlag, pp. 299–303. (1994 — 2nd Edition Washington, D.C.: Gallaudet University Press). Fillmore, C. J. (1968). The case for case. In E. Bach & R. T. Harms (Eds.), Universals of linguistic theory. New York: Holt, Rinehart & Winston. Goldin-Meadow, S. & Butcher, C. (2003). Pointing toward two-word speech. In S. Kita (Ed.) Pointing: where language, culture and cognition meet. Mahwak, NJ: Lawrence Erlbaum, pp. 85–107. Goldin-Meadow, S. & Morford, M. (1985). Gesture in early child language: Studies of hearing and deaf children. Merrill-Palmer Quarterly, 31, 145–176. Goldin-Meadow, S. & Morford, M. (1990). Gesture in early child language. In V. Volterra & C. J. Erting (Eds), From gesture to language in hearing and deaf children. New York: SpringerVerlag, pp. 249–262 (1994 — 2nd Edition Washington, D.C.: Gallaudet University Press). Goodwyn, S. & Acredolo, L. (1993). Symbolic gesture versus word: Is there a modality advantage for the onset of symbol use? Child Development, 64, 688–701. Guidetti, M. (2002). The emergence of pragmatics: forms and functions of conventional gestures in young French children. First Language, 22, 265–285. Iverson, J. M., Capirci, O., & Caselli, M. C. (1994). From communication to language in two modalities. Cognitive Development, 9, 23–43. Kendon, A. (1996). An agenda for gesture studies. Semiotic Review of Books, 7 (3), 8–12. Kita, S. (Ed.) (2003). Pointing: where language, culture and cognition meet. Mahwak, NJ: Lawrence Erlbaum. Lock, A. J. (1980). The guided reinvention of language. London: Academic Press. Lyons, J. (1977). Semantics (vols. 1 &2). Cambridge: Cambridge University Press. McNeill, D. (1992) Hand and Mind — What gestures reveal about thought. Chicago: University of Chicago Press. McNeill, D. (Ed.) (2000) Language and gesture. Cambridge: Cambridge University Press. Morford, M. & Goldin-Meadow, S. (1992). Comprehension and production of gesture in combination with speech in one-word speakers. Journal of Child Language, 19, 559–580. Pizzuto, E. (2002), Communicative gestures and linguistic signs in the first two years of life. Paper presented at the EURESCO Conferences, Brain Development and Cognition in Human Infants, Maratea, Italy, June 7–12 2002. Pizzuto, E., Capirci, O., Caselli, M. C., Iverson, J. M., Volterra, V. (2000). Children’s transition to two-word speech: Content, structure and functions of gestural and vocal productions. Paper presented at the 7th International Pragmatics Conference, Budapest, July 9–14, 2000. Slobin, D. I. (Ed.) (1985). The crosslinguistic study of language acquisition. Vol. 1: The data, Vol. 2: Theoretical issues. Mahwah, NJ: Lawrence Erlbaum. Slobin, D. I. (Ed.) (1992). The crosslinguistic study of language acquisition. Vol. 3. Hillsdale, NJ: Lawrence Erlbaum. Slobin, D. I. (Ed.) (1997). The crosslinguistic study of language acquisition. Vol. 4. Mahwah, NJ: Lawrence Erlbaum. Thal, D. & Tobias, S. (1992). Communicative gestures in children with delayed onset of oral expressive vocabulary. Journal of Speech and Hearing Research, 35, 1281–1289. Volterra, V., Erting, C. (Eds) (1990). From gesture to language in hearing and deaf children. Berlin: Springer Verlag. (1994 — 2nd Edition Washington, D.C.: Gallaudet University Press).

205

206

Elena Pizzuto, Micaela Capobianco and Antonella Devescovi

Volterra, Caselli, Capirci & Pizzuto (2005). Gesture and the emergence and development of language. In M. Tomasello and D. I. Slobin, (Eds.) Beyond nature-nurture — Essays in honor of Elizabeth Bates. Mahwah, N.J.: Lawrence Erlbaum, pp. 3–40.

About the authors Elena Pizzuto, researcher of the Italian National Research Council (CNR), currently coordinates the Sign Language Laboratory at the CNR Institute of Cognitive Sciences and Technologies. Her research focuses on the linguistic investigation of Italian Sign Language (LIS) in a crosslinguistic, crosscultural perspective, and on language development in hearing and deaf children. Micaela Capobianco is currently a post-doctoral fellow at the Università di Roma “Sapienza”, Department of Psychology of Development Processes and Socialization. Her research focuses on the role of gestures in early language learning in typically developing children, and in atypical conditions (pre-term children), and on the use of different language assessment methodologies in clinical practice. Antonella Devescovi is professor of developmental psychology at the University of Rome La Sapienza, Department of Psychology of Developmental Processes and Socialization. Her research interests are primarily in comparative crosslinguistic investigations of spoken language processing, in children and adults, in typical and atypical conditions.

Building a talking baby robot A contribution to the study of speech acquisition and evolution J. Serkhane(1), J. L. Schwartz(1), P. Bessière(2) (1) ICP, Grenoble / (2) Laplace-SHARP, Gravir, Grenoble

Speech is a perceptuo-motor system. A natural computational modeling framework is provided by cognitive robotics, or more precisely speech robotics, which is also based on embodiment, multimodality, development, and interaction. This paper describes the bases of a virtual baby robot which consists in an articulatory model that integrates the non-uniform growth of the vocal tract, a set of sensors, and a learning model. The articulatory model delivers sagittal contour, lip shape and acoustic formants from seven input parameters that characterize the configurations of the jaw, the tongue, the lips and the larynx. To simulate the growth of the vocal tract from birth to adulthood, a process modifies the longitudinal dimension of the vocal tract shape as a function of age. The auditory system of the robot comprises a “phasic” system for event detection over time, and a “tonic” system to track formants. The model of visual perception specifies the basic lips characteristics: height, width, area and protrusion. The orosensorial channel, which provides the tactile sensation on the lips, the tongue and the palate, is elaborated as a model for the prediction of tongue-palatal contacts from articulatory commands. Learning involves Bayesian programming, in which there are two phases: (i) specification of the variables, decomposition of the joint distribution and identification of the free parameters through exploration of a learning set, and (ii) utilization which relies on questions about the joint distribution. Two studies were performed with this system. Each of them focused on one of the two basic mechanisms, which ought to be at work in the initial periods of speech acquisition, namely vocal exploration and vocal imitation. The first study attempted to assess infants’ motor skills before and at the beginning of canonical babbling. It used the model to infer the acoustic regions, the articulatory degrees of freedom and the vocal tract shapes that are the likeliest explored by actual infants according to their vocalizations. Subsequently, the aim was to simulate data reported in the literature on early vocal imitation, in order to test whether

208

J. Serkhane, J. L. Schwartz and P. Bessière

and how the robot was able to reproduce them and to gain some insights into the actual cognitive representations that might be involved in this behavior. Speech modeling in a robotics framework should contribute to a computational approach of sensori-motor interactions in speech communication, which seems crucial for future progress in the study of speech and language ontogeny and phylogeny. Keywords: speech robotics, speech development, sensori-motor exploration, Bayesian robotics, vocal imitation

1. Introduction 1.1 Linking perception and action in speech robotics Speech perception and production are often studied independently one of the other. However, speech is obviously a sensori-motor system. This is the starting point of the so-called “Perception-for-Action-Control” Theory (PACT) (Schwartz et al., 2002), in which we argue that perception is the set of tools, processing and representations that enable to control action. The PACT proposes that, as the perceptual and the motor representations are acquired together during speech development, they constrain each other in adulthood, although they belong to different domains. The main idea is that to study the perceptual and the motor representations that underlie speech in adult and that shape world’s languages, a relevant strategy is to focus on how they develop in concert with each other during speech acquisition. In this approach, a natural computational modeling framework is provided by cognitive robotics, a promoter of which is R. Brooks through the Cog project, that focuses on the notions of “[…] embodiment and physical coupling, multimodal integration, developmental organization, and social interaction.” (Brooks et al., 1999). Embodiment, multimodality, development and interaction are also the core of “Speech Robotics” (Abry & Badin, 1996; Laboissière, 1992), a research program in which we try to: 1. elaborate a sensory-motor virtual “robot” able to articulate and perceive speech gestures (embodiment: Boë et al., 1995e; Schwartz & Boë, 2000) and to learn multisensorial-motor links (multimodality: Schwartz et al., 1998) in parallel to the growth of its vocal apparatus; 2. determine what could be the exploration strategies by which this robot could evolve from vocalizing and babbling to the control of complex speech gestures;

Building a talking baby robot

3. explore how communication principles in a society composed of such agents could shape the acoustic and articulatory structures of human languages (interaction: Berrah et al., 1996). The present project concerns a preliminary stage of this research program. It aims at giving the bases to model speech development, that is, the implementation of the virtual baby robot, which is a growing sensori-motor system able to learn and to interact (Schwartz et al., 2002). 1.2 A viewpoint on speech development The viewpoint supported is that the development of orofacial control in speech relies on two fundamental behaviors: the progressive exploration of the vocal tract sensori-motor abilities, and the imitation (overt simulation) of caretakers’ language sounds. That is to say, articulatory exploration should be the way by which infants discover abilities of their vocal tracts and learn relationships between movements and percepts. At the same time, imitation ought to capitalize on the knowledge acquired by exploration to tune step by step the control of the articulatory system so as to produce the gestures and sounds of the target languages. The first attempts to simulate speech development in robotics were based on the assumption that infants explore their entire space of articulatori-acoustic realizations then select their native language items out of all the possible ones (Bailly, 1997; Guenther, 1995). In other words, infants were supposed to start by uttering all possible speech sounds, in languages (in agreement with Jakobson, 1968). However, direct observation shows that infants do not do so (Kent & Miolo, 1995): whatever their ambient language, they only produce a certain subset of what can be performed with their phylogenetically inherited sensori-motor apparatus. Moreover, on a computational level, exhaustive exploration complicates the learning of sensori-motor links (Bessière, 2000). Infants do not explore the whole articulatori-acoustic space in order to master their vocal tract behaviors. Further, sensori-motor developmental facts, likely to be linked with speech development, can be classified according to whether they are roughly a matter of exploration or of imitation. (a) At birth, infants are able to imitate three gestures from vision: tongue and lips protrusions, and mandible depression (Meltzoff, 2000). Although these movements, employed in adult speech, are not obviously linked with speech development, they are nonetheless available before first vocalizations. (b) At a few weeks old, infants vocalize. Moreover, they tend to direct their productions towards vowel sounds they often perceive (early vocal imitation: Kuhl & Meltzoff, 1996), and to match a vowel sound to the moving image of the face that utters it (multimodal integration: Kuhl & Meltzoff, 1992).

209

210

J. Serkhane, J. L. Schwartz and P. Bessière

(c) At about seven-month old, they become babblers: their mandibles move upwards and downwards in a rhythmic way, while their vocal folds vibrate. This is what has been referred to as Canonical Babbling (Koopmans-Van Beinum & Van Der Stelt, 1986; MacNeilage & Davis, 1990). (d) Later on, children begin to control, more or less successively, the number of jaw cycles, the movements of the articulators carried by each cycle independently one of each other, and finally the full shape of their “vocal resonator” (motor coordination). This enables them to master sounds and sequential patterns of their ambient languages (Vilain et al., 2000). Section 2 depicts the sensory, the articulatory and the learning models the virtual robot is made of. At first, the aim was to specify its early motor skills: articulatory exploration was assessed from the acoustic description of vocalizations produced by actual infants both before (b phase) and at the beginning (c phase) of canonical babbling (Section 3; and see Serkhane et al., 2002). As for the imitation issue, a model of imitation was proposed and capitalized on to simulate an experiment on actual infants. The influence of parameters that tune the robot first imitation abilities were studied and lead to gain some information about the sensorimotor representation likely to underlie this behavior in infancy (Section 4; and see Serkhane et al., 2003). Section 5 gives some plans for the future of this project in relation to ontogeny and phylogeny.

2. The vocalizing baby robot On the production level, the Variable Linear Articulatory Model (VLAM, Boë, 1999) provides the robot with a virtual vocal tract that integrates the non-uniform growth of human tract. As for perception, the auditory, the visual and the tactile modalities are available with a model per each modality. The relationships between the tract movements and their perceived consequences are learned (during exploration) and used (in imitation) within a Bayesian robotics formalism. 2.1 The articulatory model The Variable Linear Articulatory Model (VLAM) is a version of the Speech Maps Interactive Plant (SMIP, Boë et al., 1995a) that integrates a model of the vocal tract growth. The core of the SMIP is Maeda’s model (Maeda, 1989) or a variant proposed by Gabioud (1994). Its elaboration consisted of a thorough statistical analysis of 519 hand-drawn midsagittal contours corresponding to a 50 frames/ sec. radiographic film synchronized with a labiographic film that contained 10 sentences in French, recorded at the Strasbourg Institute of Phonetics (Bothorel et al., 1986). The midsagittal contours were analyzed with a semi-polar grid, and a

Building a talking baby robot

guided principal component analysis found that seven parameters explained 88 % of the variance in the observed tongue contours, for the selected (adult) speaker. A linear combination of the seven parameters enables the regeneration of a midsagittal contour of the vocal tract. The weighting values of each parameter were normalized, using the standard deviation around the mean position of the observed values as reference. The lips shape was modeled from measurements analyzed at ICP (Abry & Boë, 1986; Guiard-Marigny, 1992). Hence the articulatory model delivers sagittal contour and lips shape from the seven input parameters (hereafter Pi, i=1..7), which may be interpreted in terms of phonetic commands, and correspond respectively more or less to the jaw (J), the tongue body (TB), dorsum (TD) and tip (TT), the lip protrusion (LP) and height (LH), and the larynx height (Lx) (Figure 1). The area function of the vocal tract is estimated from the midsagittal dimensions with a set of coefficients derived from tomographic studies. The formants and the transfer function are calculated from the area function, and a sound can be generated from formant frequencies and bandwidths. From this basis, it was possible to implement a growth model that enables to replace the adult “robot” by a “baby” one. Systematic measurements of the vocal tract from birth to adulthood do not exist at present. However, it was possible to take advantage of cranio-facial measures established at different ages by Goldstein (1980). These data were closely fitted by (double) sigmoidal curves, which characterize the general skeletal and muscular growth. To give account of the vocal tract growth, the articulatory VLAM model (Variable Linear Articulatory Model),

Figure 1. The articulatory model

211

212

J. Serkhane, J. L. Schwartz and P. Bessière

developed by Maeda (cf. Boë & Maeda, 1998) describes the evolution of the horizontal and vertical dimensions from a newborn to a female or to a male adult. As proposed by Goldstein, the growth process was introduced by modifying the longitudinal dimension of the vocal tract according to two scaling factors: one for the anterior part of the vocal tract and the other for the pharynx, interpolating the zone in-between. So, the non-uniform growth of the vocal tract can be simulated year-by-year and month-by-month. Similarly, typical F0 values were adjusted to follow the growth data presented by Mackenzie Beck (1997). A more detailed presentation of the model, together with the assessment of its agreement with both morphological and acoustical data on infants and children, can be found in Ménard et al. (2002, 2004). 2.2 The sensory models 2.2.1 Auditory model The tracking of speech gestures must involve a way to capture and characterize the basic components of the speaker’s vocal actions, namely timing and targets (Schwartz et al., 1992). A series of influential works realized in the Pavlov Institute of Leningrad in the 70s led Chistovich to propose a basic architecture for the auditory processing of speech sounds. It consists of one system specialized in temporal processing and detection of acoustic events, and the other continuously delivering various analyses about the spectral content of the input (Chistovich, 1976, 1980). The neurophysiological bases for these processing are already available in primary neurons in the auditory nerve, or secondary neurons in the cochlear nucleus (which is the first auditory processing center in the central nervous system). This provides the basis of the auditory system of the robot (Figure 2). The system specialized in event detection is based on so-called “phasic” units in the central nervous system, namely “on” and “off ” units responding only to quick increases and decreases of the neural excitation in a given spectral region. We developed a physiologically plausible module for the detection of articulatory-acoustic events such as voicing onset / offset, bursts, vocalic onset / offset (Piquemal et al., 1996; Wu et al., 1996) in the cochlear nucleus. These events, which allow the labeling of every major discontinuity in the speech signal, are crucial for the control of timing in speech production (Abry et al., 1985, 1990). The system specialized in spectral processing needs so-called “tonic” units responding continuously to a given stimulus, and then enabling precise statistics and computations about the variations of excitation depending on their characteristic frequency. Though the debate on the role of formants in the auditory processing of speech is far from closed (e.g. Bladon, 1982; Pols, 1975), it seems that basic neurophysiological ingredients are available for formant detection in the auditory nerve,

Building a talking baby robot

Frequency

Cochlea

Time

External/Middle ear

When?

What?

Event temporal distribution

A (dB)

F1

E1

Ei

En

t1

ti

tn

Event spectral characteristics

Fj Fm Formants (Bk)

Time

{ t i, F j (ti) } Figure 2. The auditory model.

through spatio-temporal statistics (Delgutte, 1984); and higher in the auditory chain, as early as in the cochlear nucleus, through lateral inhibition mechanisms for contrast reinforcement. Hence formants are the basic spectral parameters characterizing speech sounds in our system. 2.2.2. Visual model In the multisensorial framework, the robot needs eyes as much as ears. Indeed, it is quite well known that speech is not only heard but also seen (e.g. Dodd & Campbell, 1987; Campbell et al., 1998). Speechreading enables to partly follow speech gestures when audition lacks, particularly in hearing impairment; it improves speech intelligibility in noisy audio condition or with foreign languages; it intervenes in gesture recovery even if the visual input is conflicting with the audio one, as in the famous McGurk effect (McGurk & MacDonald, 1976); and the visual input is implied in the development of speech control, and in the acquisition of phonology in conjunction with cued speech for hearing-impaired people (see Schwartz et al., 2002, for a review of audio-visual fusion in the context of a theory of perception for action control). The visual sensor should be able to capture what can be seen on the speaker’s face, that is lip geometry, jaw position, and probably some parts of the tongue. At present, the visual inputs of the robot are the basic lips characteristics: height, width, area and protrusion.

213

214

J. Serkhane, J. L. Schwartz and P. Bessière

2.2.3. Tactile model The orosensorial channel, that contains the tactile sensation on the lips, the tongue and the palate, is most often absent from the modeling of the perceptual representation of speech gestures. However, Human possesses a highly developed representation of the oral space. This is illustrated by data on oral stereognosis in which subjects are able to integrate tactile and motor information to identify three-dimensional objects placed in their mouths (Bosma, 1967). The tip of the tongue and the lips belong to the most sensitive parts of the body surface, as displayed by two-point discrimination data. The neurophysiology of the tactile orosensory system has been described in a number of reviews (see e.g. Hardcastle, 1976; Landgren and Olsson, 1982; Kent et al., 1990). Most of the oral mucosa, and particularly the tongue, is supplied with mechanoreceptors of different types, able to provide detailed information about the position of the jaw, lips and tongue, and the velocity and direction of movement. Histological data show that the density of sensory endings decreases progressively from the front to the rear of the mouth: the tip of the tongue seems the best endowed with receptors in the oral system, in agreement with its accurate tactile acuity. Several data show the importance of the tactile sensor for speech control. MacNeilage et al. (1967) cited the case of a patient with a generalised congenial deficit in somesthesic perception: she produced totally unintelligible speech though she had no apparent auditory or motor trouble. Hoole (1987) and Lindblom et al. (1977) showed the influence of oral sensitivity for the production of perturbed speech (bite-block experiments). The above facts motivated the elaboration of a model for the prediction of palatal contacts of the tongue from articulatory commands (Schwartz & Boë, 2000). In this model, patterns of palatal contacts are described by five variables (hereafter Li, i=1..5) defining the number of contacts per line along five lines that go from

L1 L2 L3 L4 L5 Figure 3. The palatal tactile sensor of the baby robot. See text.

Building a talking baby robot

215

the periphery to the middle of the palate (Recasens, 1991) (Figure 3). The Li values are predicted from the articulatory commands Pi by a linear-with-threshold associator: Li = f(ΣwijPj + wi0) where wij and wi0 are the weights and the bias to learn, and f is a threshold function limiting Li to their ranges of variation, that is from 0 (no palatal contact in the corresponding line) to their maximal possible value (respectively 9, 8, 7, 5, 4). The values of wij and wi0 were tuned by minimizing a summed square error between observed and predicted Li values (Figure 4a). To test the behavior of this model, a set of predicted palatal contacts were computed for a great number (about 1,000) of articulatory configurations that lead to

Figure 4a. Predicted (in black) and observed (in gray) palatal configurations for prototypical [i], [a], [o] (from top to bottom) 200

i

u

F1 F1(Hz) ↓

700

a

Figure 4a: Predicted (in black) and observed (in gray) palatal configurations for 2200

[i], [a], [o] (from top to bottom)

F2 (Hz)

←F2

600

Figure 4b. Predicted palatal configurations (left) for a thousand articulatory configurations around [i] (formants on the right; the same was done for [a] and [u]).

216

J. Serkhane, J. L. Schwartz and P. Bessière

formant frequencies in the acoustic regions of the vowels [i], [a], [u]. Though these configurations vary largely in their articulatory parameters, it appeared that the predicted palatal contacts were quite coherent (Figure 4b), and in line with the variability of contacts observed by Recasens (1991) for vowels embedded in various consonantal contexts. Hence, the model seems able to provide useful predictions, adequately linked with the articulatory and acoustical structure of the gesture. 2.2.4. Simplified perceptual models In order to focus on learning problems, we chose to take into account a restricted and simplified set of sensory variables, easily interpretable in phonetic terms. The auditory variables were the two first formant frequencies (F1, F2) expressed in Bark, that is, a scale of frequency perception (Schroeder et al., 1979): z(Bark) = 7 Arg sh(F(Hz) / 650) The simplified tactile system relied on the vocal tract geometry, which can be described by the following systems (Boë et al., 1995b): (i) the area (Ac) and the distance from the glottis (Xc) of the main constriction along the vocal tract, as well as the inter-lip area (Al) when produced by robot vocal tract [(Fant, 1960), (ii) the coordinates (Xh, Yh) of the tongue’s highest point in a fixed system of reference (Boë et al., 1992). The visual system estimates Al when it comes from peer’s face. This set of variables is displayed on Figure 5. 2.3 The model of sensori-motor learning Learning here involves two steps, which may be synchronized in time but are studied separately in a first stage. Firstly, the robot learns basic relationships between motor commands and sensory inputs, by an endogenous exploration process (only driven by internal motivation). Secondly, the robot attempts to reproduce a given sound presented by speaking partners, given the knowledge acquired by exploration. In the future, this exogenous stage (driven by external stimuli provided by the environment) will also contribute to learning so as to focus the robot inventory of actions and percepts on the patterns of its ambient language. The Bayesian Robot Programming (BRP) environment developed for general robot programming by Lebeltel et al. (2004) is capitalized on to implement the learning and the imitation behaviors. The theoretical foundations of BRP come from the analysis of the central difficulty faced by a robot system, namely, how to use an incomplete model of its environment to perceive, infer, decide and act efficiently? To address this problem BRP proceeds in 2 steps: – The first step (learning) transforms the irreducible incompleteness into uncertainty. Given some preliminary knowledge (supplied by the designer) and

Building a talking baby robot

A (dB)

F1 F2 F3 F (Bk)

Al

Xh, Yh LIPS

PALATE

Xc

Ac

L1 L2 L3 L4 L5

GLOTTIS

Figure 5. The simplified sensory models.

some experimental data (acquired by the robot), learning builds a description of the phenomenon, which takes the mathematical form of a probability distribution. The maximum entropy principle is the theoretical foundation of this first step. Given some preliminary knowledge and some data, the probability distribution that maximizes the entropy is the distribution that best represents this couple. Entropy gives a precise, mathematical and quantifiable meaning to the “quality” of a distribution (for justifications of the maximum entropy principle see, for instance Bessière et al., 1998a & 1998b). Preliminary knowledge, even imperfect and incomplete, is relevant, and provides interesting hints about the observed phenomenon. The resulting descriptions give no certainties, but they provide a means to take the best decision given the available information. – The second step (reasoning) is in charge of making inferences with the probability distributions obtained at the first step. The BRP formalism is very general and encompassed for instance the following particular cases: Bayesian net (Pearl, 1988), Hidden Graphical Models (Lauritzen & Spiegehalter, 1988; Lauritzen, 1996; Jordan, 1998; Frey, 1998), Markov Localization (Thrun, 1998)

217

218

J. Serkhane, J. L. Schwartz and P. Bessière

and Partially Observable Markov Decision Processes (Kaelbling, Littman & Cassandra, 1996). BRP uses a strict and systematic methodology to model a phenomenon. It always proceeds as follows: A — Learning: A1. Specification: define the preliminary knowledge A1.1 — Choose the variables relevant with the behavior to model A1.2 — Decompose the joint distribution of the set of relevant variables as a product of simple distributions A1.3 — Define the parametric forms of the simple distributions A2. Identification: identify the free parameters of the simple distributions B — Reasoning: Utilization: ask a question about the joint distribution During specification (A1), the variables that define the problem to be modeled are chosen. In the case of the present work, these variables were articulatory and perceptual parameters dealt with earlier. In order to constrain the problem, the decomposition of their joint distribution takes into account the relationships between the different variables, given physical and phonetic pieces of knowledge. More precisely, the purpose was to build-in their assumed (in)dependencies to each other, be they conditional or not. Distributions affiliated to each variable (or parametric form) were chosen as being a Gaussian or a uniform law. Within this robotic framework, exploration takes place during identification, which consists of providing the robot with experimental data for its simple distributions to be actually implemented. For example, if the decomposition of the joint probability contains a Gaussian law, the associated free parameters, that is, the mean and the variance, are worked out from the set of experimental data, and this simple distribution is therefore considered as learned by the robot. At the end of learning, a description of the sensori-motor system is obtained. The robot can use it to solve problems such as inversion. Imitation requires inversion whose associated question is “which articulatory configuration could lead to the target percept”? Section 4 will describe in more details how this framework was made use of in the case of the baby-robot. However, in order to be as realistic as possible the robot had to be specified in connection with actual data. This is the purpose of the next section.

Building a talking baby robot

3. Simulating vocal exploration before and at the beginning of babbling As infants do not start by exploring all possible speech sounds, we first tried to assess articulatory abilities available both before and at the beginning of canonical babbling, that is, at 4 and 7 months. To obtain this information from the two first formant frequencies of vocalizations produced by real subjects at these developmental stages, three specially designed analysis techniques were developed. They were termed acoustic framing, articulatory framing and geometric framing. Their description and results will be given after the data they processed are presented. 3.1 Phonetic data We worked on two sets of data from studies in developmental phonetics. The first one is composed of vowel-like sounds produced by 4-month old subjects, during early vocal imitation tests from Kuhl and Meltzoff (1996), (see Section 4.1 for further details). Matyear and Davis supplied us with the second set of data, collected in order to study syllable-like productions in Canonical Babbling (Matyear, 1997; Matyear et al., 1998). We selected their 7-month old subjects’ vowel-like sounds, at Canonical Babbling onset. These two studies present the interest to have been carefully acquired and carefully labeled and analyzed in a series of paradigms and protocols described in great detail in the original publications. In each case, the two first formant values and a phonetic description were available for analysis. 3.2 Acoustic framing 3.2.1. Method All the sounds generated by the VLAM belong to the Maximal Vowel Space (MVS) (Boë et al., 1989). MVS corresponds to what an infant at a given age would be able to produce if s/he used the complete set of articulatory commands, defined as all values between –3 and +3 times the standard deviation, that is covering the whole range of possible values for each parameter. So, it stands for all “possible speech sounds” plotted on a multi-formant (Fi) map. The (F1, F2) plane displays the vocalic triangle, attested by phoneticians and whose corners include the [i a u] vowels. The acoustic framing consists of superimposing an age-specified set of actual data on the MVS of the VLAM at the same age. Hence, it tests whether actual vocalizations belong to this MVS and assess the acoustic space region(s) explored by 4- and 7- month old infants.

219

220

J. Serkhane, J. L. Schwartz and P. Bessière

3.2.2. Results Each set of actual vocalizations did belong to the corresponding MVS (Figures 6–7). Moreover, the actual data did not entirely cover the space they would have if they had corresponded to mature motor control products. More precisely, the 4-month old vocalizations, displayed as black dots superimposed on the MVS in gray in Figure 6, were relatively centered and mid-high: the most fronted, backed and open productions did not seem to be exploited. At 7-month old (Figure 7),

F1 ↓

←F2 Figure 6. Acoustic framing of 4-month-olds’ vocalizations (black dots). Gray dots correspond to the 4-month MVS. The Fi are expressed in Hertz.

Figure 6: Acoustic framing of 4-month-olds' vocalizations (black dots). to the 4-month MVS. The Fi are expressed in Hertz. F1 ↓

←F2 Figure 7. Acoustic framing of 7-month-olds’ vocalizations (black dots). Gray dots correspond to the 7-month MVS. The Fi are expressed in Hertz.

Figure 7: Acoustic framing of 7-month-olds' vocalizations (black dots). Gray dots correspo

Building a talking baby robot

the vocalic productions exploited the high-low dimension more than at the earlier stage. 3.3 Articulatory framing 3.3.1. Method Certain regions of the MVS, generated by the 7 articulatory parameters of the VLAM, were not exploited by the actual data. The articulatory framing allowed to evaluate infants’ motor abilities by constraining the motor variables of the VLAM. In other words, this method aims at assessing the minimal set of articulatory degrees of freedom required to reproduce the observed vocalic sounds. We built several articulatory sub-models with different sets of the VLAM motor parameters. A given sub-model was therefore characterized by its number of articulators and their ranges of variation. 255 sub-models were comparatively assessed with respect to the efficiency by which they reproduced each collection of phonetic data, using their probabilities given the actual vocalizations: P(Mi/f1f2), where Mi denotes the ith sub-model, characterized by the set of acoustic outputs it generates, while f1f2 stands for the actual data formant values. The winner is the sub-model that fitted the best a given set of actual data: it maximized the conditional probability criterion. 3.3.2. Results The results for the 4-month old data indicate that exploration at four months is rather reduced around the neutral configuration. It involves at least three articulatory parameters including at least one for the tongue, and the jaw seems to play a minor role in this exploration. The winner sub-model (Figure 8) exploited the lower lip height (LH), tongue body (TB) and dorsum (TD) degrees of freedom. At seven months, exploration is much larger, and jaw now plays a dominant role leading to a large exploration of the open-close contrast and its associated F1 dimension in the formant space (Figure 9). This result agrees with babblers’ mandible use. 3.4 Geometric framing 3.4.1. Method Articulatory framing enabled to infer the tongue configurations that could have yielded the acoustic data recorded at 4 and 7 months. The geometric framing is a method of exhaustive inversion: each vocalization corresponds to a set of tract shapes (geometry), produced by the winner and corresponding to acoustically

221

222

J. Serkhane, J. L. Schwartz and P. Bessière

F1 ↓

←F2 Figure 8. The articulatory framing of the 4-month-olds vocalizations by the selected three-parameters articulatory sub-model. The black dots correspond to the actual data, while the gray ones to the sub-model acoustic outputs. The grid shows the boxes employed to compute the probability criterion. The Fi are expressed in Hertz.

F1 ↓

Figure 8: The articulatory framing of the 4-month-olds vocalizations by the ←F2 parameters articulatory sub-model. The black dots correspond to the actual

Figure 9. The articulatory framing of the 7-month-olds vocalizations by the selected four-parameters articulatory sub-model. Same caption as in Figure 8.

gray ones to the sub-model acoustic outputs. The grid shows the boxes emplo

plausible products. Two systems were capitalized on to describe the vocal tract the probability criterion. The Fi are expressed in Hertz. geometry (see Section 2.2.4): {Xc, Ac, Al} and {Xh, Yh}. Thus, a given vocalic sound could be associated with the mean and variance of these geometric variables in the group of corresponding tract shapes. As compensation leads to rather high variances, especially in central vocalizations, for clarity’s sake, we only displayed the

Building a talking baby robot

dispersion ellipses of 4 “prototypes” added to each set of real data: [i a u] had been chosen at a roughly equivalent position to the adult’s in the MVS, whereas [6] was produced by all commands set to 0. So, [i a u 6] served as landmarks. 3.4.2. Results At 4 months (Figure 10), the average tract shapes (plotted by gray stars on the figure) had tongue’s highest points rather centered and gathered (around [6]). The constrictions were slightly fronted and fairly wide. At 7 months (Figure 11), the A. Acoustic space

B. The tongue's highest point

C. Ac versus Xc

←LIPS

D. Al versus Ac

GLOTTIS →

Figure 10. The geometric framing of the pre-babblers’ vocalizations with the agedmatched winner (4 months old). In the acoustic domain (Panel A), the yellow stars correspond to the actual acoustic data, the mauve ones stand for the sub-model acoustic simulations from which, around each actual vocalization, a group of sounds was selected to perform the exhaustive inversion. Each group is color coded along the F2 axis (from cold to warm colors) so as to be able to track the means of the geometric characteristics of the resulting shape in the {Xh, Yh} space (Panel B) and the {Xc, Ac, Al} space (Panels C and D). The points represented by the characters “i a u 6” correspond to “prototypic” formant values of the adult-like vowels (Panel A) that have been exhaustively inverted using the aged-matched VLAM. The dispersion ellipses of the geometric characteristics of their inferred average shapes are the only to be plotted for clarity’s sake.

223

224

J. Serkhane, J. L. Schwartz and P. Bessière

A. Acoustic space

B. The tongue's highest point

C. Ac versus Xc

←LIPS

D. Al versus Ac

GLOTTIS →

Figure 11. The geometric framing of the (7 months old) babblers’ vocalizations with the aged-matched winner. Same caption as in Figure 10.

tongue positions showed a larger exploration of the high-low and front-back dimensions than at the earlier stage. Moreover, we found that, before canonical babbling (4 months), all the articulatory configurations leading to first two formant frequencies falling within the [u] region had palatal constrictions. This result is of interest with regard to how the developmental path followed by articulatory exploration may shape adult speech. Indeed, although three types of tract constriction (palatal, velo-pharyngeal and pharyngeal) should be able to produce the vowel [u] with identical first three formants (Boë et al. 2000), the only to be found in the native (adult) speakers of all the languages tested is palatal (Wood, 1979). The velo-pharyngeal [u] is seldom observed in perturbation experiments (lip-tube, Savariaux et al., 1995) while the pharyngeal one has never been recorded. According to Abry and colleagues (1996), the palatal [u] would be the first [u] production strategy picked during speech development: its dominant position in adulthood would stem from its early sensorimotor mapping. This hypothesis is in agreement with the palatal nature of the productions in the acoustic region around [u] in the simulations at four months.

Building a talking baby robot

3.5 Conclusion The results of the simulation of vocal exploration in infants point out that speech development does not begin with exhaustive exploration of the tract potential. We may suggest that “explore all possible speech sounds, then select what is needed to communicate” would be a much more time- and energy-consuming strategy than, for instance, “explore, according to currently available abilities, and try to produce what is perceived in the ambient language just to develop the motor skills needed”. The second strategy should provide a higher adaptive value than the first one, as more resources would be left for the development of other biological functions. From an evolutionary point of view, this would account for the first strategy counter-selection. To sum up, before canonical babbling, infants would use the lower lip height (LH), tongue body (TB) and dorsum (TD) commands, which is coherent with newborn imitation studies. Furthermore, the importance of TD is in agreement with its likely role in suckling. The jaw articulator (J) would play only a minor role at this stage, and become significant in canonical babbling data.

4. Simulating early vocal imitation In this section, we tried to simulate Kuhl and Meltzoff ’s experiment on early vocal imitation (Kuhl & Meltzoff, 1996), which takes place, at least, before canonical babbling. The purpose was to gain some insights into the cognitive representations that might be involved in early vocal imitation and to test whether and how the robot is able to reproduce, at least, the actual infants’ imitation performance. First, an overview of Kuhl and Meltzoff ’s experiment as well as a description of how the problem was translated into Bayesian terms will be given. Then, the implementation of imitation and the corresponding results will be presented. 4.1 An overview of Kuhl and Meltzoff ’s experiment on early vocal imitation 72 subjects aged from 12 to 20 weeks old were exposed to audiovisual adult facevoice stimuli corresponding to the vowels [i], [a] and [u]. Only 45 of them happened to produce vowel-like utterances during the experiment. Their subsequent vowel-like productions were phonetically and acoustically described. The system of transcription was that of the set of English vowels but the transcribed items were merged into three main classes: the /a/-like, grouping low vowels, the /i/-like, grouping front vowels, and the /u/-like, fronting back vowels. Table 1 provides the resulting confusion matrix, that is the number of “i-like”, “u-like” and “a-like” vo-

225

226

J. Serkhane, J. L. Schwartz and P. Bessière

Table 1. The confusion matrix of early vocal imitation reported in Kuhl and Meltzoff (1996). Each cell provides the number of “i-like”, “u-like” and “a-like” vocalizations (see text) for each of the three possible adult targets [ι α υ]. Among the 72 infants in the experiment, only 45 produced vowel-like utterances. Altogether the 45 infants uttered 224 vowel-like vocalizations along the experiment. i-like a-like u-like Total

i 22 25 20 67

a 11 66 18 95

u 4 14 44 62

Total 37 105 82 224

calizations (according to the criterion presented here above) for each of the three possible adult targets [i a u]. In sum, the pre-babblers produced vocalic sounds significantly more often categorized as being like the “target” after they had been exposed to this stimulus than otherwise. Globally, the subjects performed around 59 % of responses that are congruent (hereafter %CR) with an imitative behavior. Further, about 16.5%, 47% and 36.5% of their vocalizations sounded /i/-, /a/- and /u/-like, respectively. 4.2 Specifying the model In the Bayesian robotics framework, the robot learns a sensori-motor map of its vocal tract behavior corresponding to a probabilistic description of the observable links between its perceptual and its articulatory variables. Then, imitation corresponds to inversion, that is, the conversion of a sensory state into a motor counterpart. The motor parameters chosen were selected according to the results of articulatory framing at 4 months (Section 3.3), i.e. the lower lip height (LH), the tongue body (TB) and dorsum (TD) commands while the auditory variables (Section 2.2.1) were the first two formant frequencies (F1, F2) expressed in Bark. The formants of a vocalic sound are function of the tract shape whose mid-sagittal section can be described by three variables: the inter-lip area (Al) and the coordinates (Xh, Yh) of the tongue highest point in a fixed system of reference. As mentioned in Section 2.2.3, Xh and Yh are potential outputs of the somesthetic system and Al can be either a somatosensory or a visual variable (depending on whether this piece of information comes from self or the other). All the model variables were supposed to be discrete, varying in a set of mutually exclusive values. The core of a Baseyian robot is the set of statistical relationships that define the links between variables. {Xh, Yh, Al} were used as pivots of the joint probability decomposition. Indeed, since they provide an intermediate space between

Building a talking baby robot

the auditory space and the articulatory space, they help reduce the impact of the many-to-one problem on inversion (Boë et al., 1992) when they function as independent variables in the joint probability decomposition. Then, further assumptions lead to the following probabilistic structure: P(LH⊗TB⊗TD⊗Xh⊗Yh⊗Al⊗F1⊗F2) (1) = P(Xh) * P(Yh) * P(Al) * P(LH/Al) * P(TB/Xh⊗Yh) * P(TD/Xh⊗Yh⊗TB) * P(F1/Xh⊗Yh⊗Al) * P(F2/Xh⊗Yh⊗Al) This equation specifies the decomposition of the global probability distribution linking all articulatory (LH, TB, TD), intermediate (Xh, Yh, Al) and auditory (F1, F2) variables (first line of Eq. 1). The first three factors (second line) indicate that (Xh, Yh, Al) are considered as the primary variables, supposed to be independent. The next three factors (third line) indicate the minimum set of links between intermediate and articulatory variables: Al specifies the lips (LH), while (Xh, Yh) specify the tongue (TB, TD). The two last factors (last line) express the links between intermediate and auditory variables, supposed to be independent one of the other. In this equation, the independent variables Xh, Yh, Al were associated with uniform distributions, while all other factors were conditional probabilities supposed to obey Gaussian laws, the mean and variance of which had to be tuned in the learning phase. 4.3 Learning the model To become an actual (and useful) description of the robot’s sensori-motor behavior, the distributions composing this probabilistic structure need to be learnt from a set of experimental data that corresponds, here, to a random exploration of the articulatori-geometrico-acoustic skills of the 4-mth robot specified in Section 3.3.2 (R4m in the following). The robot’s “proficiency” in inversion, that is, in exploiting its map via Bayesian inference to draw motor values likely to make it reach a given target-state of its perceptual variables, will mainly depend on the learning database size (DBS) and the degree of discretization of the geometric parameters (GDD). Indeed, as Xh, Yh and Al are the pivot of the description, the GDD partly determines the accuracy of the distributions the robot learns: it gives the minimal gap required to distinguish two items in the geometric domain and it sets the adequate size of the learning space, that is, the number of articulatory and auditory distributions that have to be learned for the description to represent the whole range of the R4m abilities. Indeed, there is a trade-off between the GDD and the DBS because a given geometric box must include enough configurations for the matched motor and auditory distributions to be learned.

227

228

J. Serkhane, J. L. Schwartz and P. Bessière

Error (Bark)

2,5 2

GDD: {16,16,8} {8,8,4} {4,4,2} {2,2,1}

1,5 1 0,5 0 1

10

100

1000

10000

100000

DBS

Figure 12. Assessing the GDD/DBS trade-off. Mean formant error at the output of the inversion process (in Bark) as a function of the DBS (GDD as parameter).

In order to evaluate which description could best account for the performance reported in Kuhl and Meltzoff (1996), 4 GDD x 15 DBSs were tested. The DBS ranged from 1 to 60,000 items. The GDD were {16, 16, 8}, {8, 8, 4}, {4, 4, 2} and {2, 2, 1} for the number of {Xh, Yh, Al} classes, which yielded 2048, 256, 32 and 4 “boxes” in the geometric space, respectively. In a first step, the GDD/DBS trade-off was studied through the ability of the model to perform inversion of vocalizations in its exploration domain. Figure 12 illustrates the results for the auditory inversion of 1000 items randomly chosen out of the R4m abilities. At maximal DBS that is for the largest amount of learning, the error decreases, as the GDD increases, and reaches values around 0.2 Bk (roughly, formant jnd) for the highest two GDD values. Moreover, for a given GDD, the error tends to decrease, along with the DBS rise, until a limit that is the lowest this GDD can make the robot reach. However, all the GDDs, but the roughest, provide unstable scores as long as the DBS is below a certain value. This is due to the fact that not all geometric boxes are actually learned (under-learning phase). Indeed, the smallest DBS that is required to have an error at most 10% from the GDD-matched lowest error was found to be three times the size of the geometric space (in boxes). In other words, the more boxes in the geometric space (the larger the GDD is), the more precise its variables are, but the larger the DBS must be for the robot’s map to be representative of its sensorimotor skills (at least three times larger than GDD). 4.4 Implementing auditory and audio-visual imitation Once a model, defined by a given GDD, has been learned on a given DBS, it can be submitted to imitation tests. Since the experimental data were obtained in an audio-visual configuration, we submitted the robot to two imitation tasks, audio-only

Building a talking baby robot

and audio-visual imitation, to assess the role of multimodality in this framework. In auditory (hereafter A) imitation, the perceptual target was the (F1, F2) pair of a vowel, while in the audiovisual (AV) one it was its (F1, F2, Al) values. Two target sets were the focus of imitation experiments, that is, “external” and “internal” [i a u] items. The former corresponded to those of the 4 months old VLAM, the latter were their closest simulations within the R4m capacity. This means that both target sets were adapted to the 4-mths articulatory-acoustic space (“normalized” targets), but the first one consisted in [i], [a] or [u] targets outside the true vocalization space at 4 mths, while the second one consisted in the three corners of this space. For each target, 300 motor configurations were drawn from the P(LH⊗TB⊗TD/ PerceptualTarget) distribution. The formants produced by each articulatory pattern were computed and the sound was categorized as [i], [a] or [u] according to its nearest target in the (F1, F2) plan, in terms of Euclidean distance. This allowed to compute congruent imitation scores %CR for A and AV imitation, for both external and internal targets, and for various values of GDD and DBS. 4.5 A and AV imitation results The congruent response scores %CR as functions of the GDD and the DBS in the AV inversion of the internal and external [i a u] targets are displayed in Figures 13 et 14, respectively. A inversion scores, not displayed here, are systematically slightly lower. Furthermore, the following trends appear.

GDD/DBS Trade-off and under-learning Of course, the same GDD/DBS trade-off as in Figure 12 is found in all cases. Under-learning happens when the imitation scores are lower than their asymptote for a given GDD (DBS not large enough for this GDD), and results in a rather erratic behavior of %CR scores. Globally, under-learning is greater for external than for internal targets, and ends more quickly for AV than for A imitation.

External vs. internal targets: The risk of over-learning The scores for external targets are lower than for internal ones, which is quite understandable, considering that the former are outside the R4m vocalization space, while the latter are not. More surprisingly, in the A case the imitation score never reaches 100% with external targets even with the highest GDD and DBS configurations, that is 2048 geometrical boxes and 60,000 items in the learning set. This is ascribable to the over-learning problem. Indeed, when the description is completely

229

230

J. Serkhane, J. L. Schwartz and P. Bessière

%CR

100 90 80

GDD:

70

{16,16,8} {8,8,4} {4,4,2} {2,2,1} infants

60 50 40 30 20 10 0

1

10

100

1000

10000

100000

DBS

Figure 13. %CR for the AV inversion of the “internal” [i a u] vowels, as a function of the DBS (GDD as parameter). “Infants” stands for the score obtained by 12–20 weeks infants in the study by Kuhl and Meltzoff (1996).

%CR

100 90 80

GDD:

70

{16,16,8} {8,8,4} {4,4,2} {2,2,1} infants

60 50 40 30 20 10 0

1

10

100

1000

10000

100000

DBS

Figure 14. %CR for the AV inversion of the “external” [i a u] vowels, as a function of the DBS (GDD as parameter). “Infants” stands for the score obtained by 12–20 weeks infants in the study by Kuhl and Meltzoff (1996).

representative of the robot sensori-motor abilities (e.g. with a maximal DBS), all the distributions of the motor variables have small variances, that is, are very accurate. However, none of them matches the target the robot tends to imitate. Hence, the system draws articulatory configurations regardless of their irrelevance given the sound. In other words, the GDD has to contain a small number of large boxes for the robot to be able to imitate vocalic sounds that are out of its sensori-motor abilities. The problem is overcome if the visual information is also provided: since the VLAM [ι α υ] inter-lip areas belong to the R4m ones, the robot is enabled to select configurations that produce sounds close to the target.

Building a talking baby robot

Early vocal imitation does not need much learning Altogether, it is striking to notice that the robot needs neither a high GDD nor a large DBS, in order to perform as good as, and even better than, the actual infants. For example, in the case of external targets which are out of its motor abilities (which corresponds more closely with the experimental conditions of the imitation data in the Kuhl & Meltzoff study), it gets 60% CR (as infants did) or more with DBSs of 50 and 25 data and GDD of 32 boxes, in A and AV inversions, respectively. 4.6 Conclusion The major lesson in this second study is that a very small number of vocalizations (less than a hundred) are necessary for a robotic learning process to provide imitation scores at least as high as those of 20-weeks infants. This is due to the fact that the imitation task studied by Kuhl & Meltzoff is basically a three-categories problem, which can be described rather simply and roughly in articulatory-acoustic terms, hence the success of the present robotic experiment. This shows that actually, more than learning, the problem is of course control, that is achieving to produce a desired articulatory configuration … which the infant is not able to do easily at four months. The A and AV imitation experiments displayed a trade-off between the somesthetic acuity of the tract shape representation (GDD) and the amount of information (DBS) to learn in order to build a sensori-motor map that is representative enough of the robot skills. Further, our results show that the GDD has to be rough for the robot to be able to imitate vocalic sounds that are out of its articulatori-acoustic abilities. This is interesting since, in fact, the infants must acquire, by imitation, the speech sounds of their ambient languages although they are not endowed from birth with the matched motor capacity. Moreover, this investigation supports the view that the formation of the cognitive representation likely to underlie early vocal imitation would require less learning with audiovisual speech perception than without vision. This gives some evidence that the latter can facilitate phonetic development and is congruent with the slight differences in speech development between seeing and visually impaired children (Mills, 1987). Altogether, this preliminary work confirms that infants complement the very early visuo-facial imitation abilities by using auditory-to-articulatory relationships, and shows that a very small amount of data is enough to produce realistic imitation scores, if the discretization is rough enough.

231

232

J. Serkhane, J. L. Schwartz and P. Bessière

5. Perspectives in the study of ontogeny and phylogeny The experiments described here anchor both the articulatory and the perceptual representations of the baby robot in actual infants’ perceptuo-motor ground. The continuation of this work will consist in allowing the robot to grow up, mimicking as much as possible the developmental process at work in human speech acquisition. This involves the various steps described in Section 1.3, and particularly the acquisition of frame and content control in the production of syllables (Davis & MacNeilage, 1995; MacNeilage, 1998). All over this process, an important output of the work will consist in information about the perceptual and the motor representations acquired by the system at the various developmental stages. In a way, it should provide a window on the representations of speech in the baby’ and child’s brain, which cannot be directly observed by simple means. Another challenge will be to study how speech as a linguistic system may be patterned by both perceptual and motor constraints. This route towards a “substance-based” approach of phonology, that simulates speech phylogeny, is not new. One of its precursor is found within the Adaptive Variability Theory by Lindblom and colleagues, with a number of important results on the prediction of vowel systems (see e.g. Liljencrants & Lindblom, 1972; Lindblom, 1986, 1990; and the extension we proposed through the “Dispersion-Focalization Theory”, Schwartz et al., 1997) and of consonant systems (e.g. Lindblom, 1997; Boë et al., 2000; Abry, 2003). More recently, Steels and others introduced the concept of speech games in societies of talking agents (e.g. Steels, 1998; Berrah et al., 1996; de Boer, 2000). The definition of more realistic agents, able to act, perceive and learn in a biologically, developmentally and cognitively plausible way, is crucial there. Integrating perception and action within a coherent computational framework is a natural way to better understand how speech representations are acquired, how perception controls action and how action constrains perception. This also provides a natural framework to integrate various sources of knowledge about the speech process, including behavioral and developmental data, neurophysiological and neuropsychological facts about the neural circuits of perception, action and language, and linguistic knowledge about phonology or syntax, and to attempt to draw some links between these complex ingredients in order to begin to write the story of the emergence of human language. We believe that modeling speech communication in a robotic framework should contribute to a computational approach, which is relevant for future progress in the study of speech and language ontogeny and phylogeny.

Building a talking baby robot

Acknowledgements This work was prepared with support from the European ESF Eurocores program OMLL, and from the French funding programs CNRS STIC Robea and CNRS SHS OHLL, and MESR ACI Neurosciences Fonctionnelles. It benefited from discussions with and suggestions from LouisJean Boë, Barbara Davis, Chris Matyear, Emmanuel Mazer and Christian Abry.

References Abry, C. (2003). [b]-[d]-[g] as a universal triangle as acoustically optimal as [i]-[a]-[u]. 15th Int. Congr. Phonetics ICPhS, 727–730. Abry, C., Badin, P. et al. (1996). Speech Mapping as a framework for an integrated approach to the sensori-motor foundations of language. 4th Speech Production Seminar, 1st ESCA Tutorial and Research Workshop on Speech Production Modelling: from control strategies to acoustics, 175–184, May 21–24, 1996, Autrans, France. Abry, C., Benoît, C., Boë, L. J., & Sock, R. (1985). Un choix d’événements pour l’organisation temporelle du signal de parole. 14èmes Journées d’Etudes sur la Parole, Société Française d’Acoustique, 133–137. Abry, C. & Boë, L.-J. (1986). Laws for lips. Speech Communication, 5, 97–104. Abry, C., Orliaguet, J. P., & Sock, R. (1990). Patterns of speech phasing. Their robustness in the production of a timed linguistic task: Single vs. double (abutted) consonants in French. European Bull. of Cogn. Psych., 10, 269–288. Bailly, G. (1997). Learning to speak. Sensori-motor control of speech movements. Speech Communication, 22, 251–268. Berrah, A. R., Glotin, H., Laboissière, R., Bessière, P., & Boë, L. J. (1996). From form to formation of phonetic structures: An evolutionary computing perspective. In T. Fogarty & G. Venturini (Eds.), ICML’96 Workshop on Evolutionary Computing and Machine Learning (pp. 23–29). Bari: Italy. Bessière, P. (2000). Vers une théorie probabiliste des systèmes sensori-moteurs. HDR, Université Joseph Fourier, Grenoble, France. Bessière, P., Dedieu, E., Lebeltel, O., Mazer, E. & Mekhnacha, K. (1998a). Interprétation ou description (I) : Proposition pour une théorie probabiliste des systèmes cognitifs sensori-moteurs. Intellectica, 26–27, 257–311. Bessière, P., Dedieu, E., Lebeltel, O., Mazer, E. & Mekhnacha, K. (1998b). Interprétation ou Description (II) : Fondements mathématiques de l’approche F+D. Intellectica, 26–27, 313–336. Bladon, A. (1982). Arguments against formants in the auditory representation of speech. In R. Carlson & B. Granström (Eds.), The Representation of Speech in the Peripheral Auditory System (pp. 95–102). Amsterdam: Elsevier Biomedical. Boë, L.-J. (1999). Modelling the growth of the vocal tract vowel spaces of newly-born infants and adults. Proc. XIVth International Congress of Phonetic Sciences, San Francisco, USA, 2501–2504 Boë, L. J., Abry, C., Beautemps, D., Schwartz, J. L., & Laboissière, R. (2000). Les sosies vocaliques — Inversion et focalisation. XXIIIèmes Journées d’Étude sur la Parole, Aussois, 257–260.

233

234

J. Serkhane, J. L. Schwartz and P. Bessière

Boë, L. J., Gabioud, B., & Perrier, P. (1995a). Speech Maps Interactive Plant « SMIP ». Proc. XIIIth International Congress of Phonetic Sciences, vol. 2 (426–429), Stockholm, Sweden. Boë, L.-J, Gabioud, B., Perrier, P., Schwartz, J.-L., & Vallée, N. (1995b). Vers une unification des espaces vocaliques. In C. Sorin et al. (Eds.), Levels in speech communication: Relations and interactions (pp. 63–71). Amsterdam: Elsevier Science. Boë, L.-J, & Maeda, S. (1998). Modélisation de la croissance du conduit vocal. Journées d’Études Linguistiques “La Voyelle dans tous ses états” (pp. 98–105). Nantes. Boë, L. J., Perrier, P., & Bailly, G. (1992). The geometric vocal tract variables controlled for vowel production: Proposals for constraining acoustic-to-articulatory inversion. Journal of Phonetics, 20, 27–38. Boë, L. J., Perrier, P., Guérin, B., & Schwartz, J. L. (1989). Maximal vowel space. Proc. of Eurospeech 89, 281–284. Boë, L.-J., Vallée, N., Badin, P., Schwartz, J.-L. & Abry, C. (2000). Tendencies in phonological structures: The influence of substance on form. Current Trends in Phonology and Phonetics II: Relationship between phonetics and phonology. Les Cahiers de l’ICP, Bulletin de la Communication Parlée, 5, 35–55. Bosma, J. F. (Ed.) (1967). Symposium on oral sensation and perception. Springfield, IL: Charles C. Thomas. Bothorel, A., Simon, P., Wioland, F. & Zerling, J.-P. (1986). Cinéradiographie des voyelles et des consonnes du français. Recueil de documents synchronisés pour quatre sujets: vues latérales du conduit vocal, vues frontales de l’orifice labial, données acoustiques. Strasbourg: Institut de Phonétique. Brooks, R. A., Breazeal, C., Marjanovic, M., Scassellati, B. & Williamson M. (1999). The Cog Project: Building a humanoid robot. In C. Nehaniv (Ed.) Computation for metaphors, analogy, and agents. [Lecture Notes in Artificial Intelligence 1562] (pp. 52–87). New York: Springer. Campbell, R., Dodd, B., & Burnham D. (Eds.) (1998). Hearing by eye, II. Perspectives and directions in research on audiovisual aspects of language processing. Hove: Psychology Press. Chistovich, L. A. (1976). Physiology of speech: Human speech perception. Leningrad: Nauka (in Russian). Chistovich, L. A. (1980). Auditory processing of speech. Language and Speech, 23, 67–72. Davis, B., & MacNeilage, P. F. (1995). The articulatory basis of babbling. Am. SLH Ass. 38, 1199– 1211. de Boer, B. G. (2000). Self-organisation in vowel systems. Journal of Phonetics, 441–465. Delgutte, B. (1984). Speech coding in the auditory nerve II: Processing schemes for vowel-like sounds. J. Acoust. Soc. Am., 75, 879–886. Dodd, B., & Campbell, R. (Eds.) (1987). Hearing by eye: The psychology of lipreading. Mahwah, NJ: Lawrence Erlbaum Associates. Fant, G. (1960). Acoustic theory of speech production. Mouton: The Hague. Gabioud, B. (1994). Articulatory models in speech synthesis. In E. Keller (Ed.), Fundamentals of speech synthesis and recognition. Basic concepts, state-of-the-art and future challenges (pp. 215–230). Chichester: John Willey. Goldstein, U. G. (1980). An articulatory model for the vocal tract of the growing children. Thesis of Doctor of Science, MIT, Cambridge, Massachusetts, USA. Guenther, F. H. (1995). Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production. Psychological Review, 102, 594–621.

Building a talking baby robot

Guiard-Marigny, T. (1992). Modélisation des lèvres. DEA Signal Image Parole, INP, Grenoble, France. Hardcastle, W. J. (1976). Physiology of speech production. London: Academic Press. Hoole, P. (1987). Bite-block speech in the absence of oral sensibility. Proc. ICPhS, Tallinn, 4, 16–19. Jakobson, R. (1968). Child language aphasia,and phonological universals. The Hague: Mouton. Jordan, M. (1998). Learning in graphical models. Cambridge, MA: MIT Press. Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. (1996). Partially observable Markov decision processes or artiﬁcial intelligence; Reasoning with Uncertainty in Robotics. Proc. International Workshop RUR’95 (pp.146–62). Springer-Verlag Kent, R. D., Martin, R. E., & Sufit, R. L. (1990). Oral sensation: A review and clinical prospective. In H. Winitz (Ed.), Human communication and its disorders (pp. 135–191). Norwood, NJ: Ablex Publishing. Kent, R. D., & Miolo,G.(1995). Phonetic abilities in the first year of life. In Fletcher, P. & MacWinney (Eds.) The handbook of child language. Oxford: Blackwell. Koopmans-Van Beinum, F., & Van Der Stelt, J. (1986). Early stages in the development of speech movements. In B. Lindblom, & Zetterstrom, R. (Eds,) Precursors of early speech (pp. 37–49). New York: Stockton Press. Kuhl, P., & Meltzoff, A. N. (1992). The bimodal perception of speech in infancy. Science, 218, 1138–1141. Kuhl, P., & Meltzoff, A. N. (1996). Infant vocalizations in response to speech: Vocal imitation and developmental changes. J. Acoust. Soc. Am., 100, 2425–2438. Laboissière, R. (1992). Préliminaires pour une robotique de la communication parlée : inversion et contrôle d’un modèle articulatoire du conduit vocal. Thèse de Docteur de l’INPG, SignalImage-Parole, Grenoble, France. Landgren, S., & Olsson, K. A. (1982). Oral mechanoreceptors. In S. Grillner (Ed.) Speech motor control. Oxford: Pergamon. Lauritzen, S. L. (1996). Graphical models. Oxford: Oxford University Press. Lauritzen, S., & Spiegelhalter, D. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society B, 50, 157–224. Lebeltel, O., Bessière, P., Diard, J., & Mazer, E. (2004). Bayesian robot programming. Autonomous Robot, 16, 49–79. Liljencrants, J., & Lindblom, B. (1972). Numerical simulations of vowel quality systems: The role of perceptual contrast. Language, 48, 839–862. Lindblom, B. (1986). Phonetic universals in vowel systems. In J. J. Ohala and J. J. Jaeger (Eds.), Experimental Phonology (pp. 13–44). New York: Academic Press. Lindblom, B. (1990). On the notion of possible speech sound. Journal of Phonetics, 18, 135–152. Lindblom, B. (1997). Systemic constraints and adaptive change in the formation of sound structure. In J. Hurford (Ed.), Evolution of human language. Edinburgh: Edinburgh Univ. Press. Lindblom, B., Lubker, J., & McAllister, R. (1977). Compensatory articulation and the modeling of normal speech production behavior. In R. Carré et al. (Eds.), Articulatory modeling and phonetics (pp. 147–161). GALF. Mackenzie Beck, J. (1997). Organic variation of the vocal apparatus. In W. J. Hardcastle & J. Laver (Eds.), The handbook of phonetic sciences (pp. 256–297). London: Blackwell Publishers.

235

236

J. Serkhane, J. L. Schwartz and P. Bessière

MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. BBS 21 (4), 499–511. MacNeilage, P. F., & Davis, B. (1990). Acquisition of speech production, frames then content. In M. Jeannerod (Ed.), Attention and performance, XIII: Motor representation and control (pp. 453–476). MacNeilage, P. F., Rootes, T. P., & Chase, R. A. (1967). Speech production and perception in a patient with severe impairment of somesthesic perception and motor control. Journal of Speech and Hearing Research, 10, 449–467. Maeda, S. (1989). Compensatory articulation during speech: Evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model. In W. J. Hardcastle & A. Marchal (Eds.), Speech production and modelling (pp. 131–149). Dordrecht: Kluwer. Matyear, C. L. (1997). An acoustical study of vowels in babbling. Doct. diss. University of Texas. Austin (unpublished). Matyear, C. L., MacNeilage, P. F., & Davis, B. L. (1998). Nasalization of vowels in nasal environments in babbling: evidence for frame dominance. Phonetica, 55, 1–17. McGurk, H., MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. Meltzoff, A. N. (2000). Newborn imitation. In Min, D. & Blater, A. (Eds.), Infant development, the essentiel readings (pp. 165–181). Oxford: Blackwell. Ménard, L., Schwartz, J.-L., Boë, L.-J., Kandel, S., & Vallée, N. (2002). Auditory normalization of French vowels synthesized by an articulatory model simulating growth from birth to adulthood. Journal of the Acoustical Society of America, 111(4), 1892–1905. Ménard, L., Schwartz, J. L., & Boë, L. J. (2004). The role of vocal tract morphology in speech development: Perceptual targets and sensori-motor maps for French synthesized vowels from birth to adulthood. Journal of Language, Speech and Hearing Research, 47, 1059–1080. Mills, A. E. (1987). The development of phonology in the blind child. In B. Dodd and R. Campbell (Eds.), Hearing by eye: The psychology of lipreading (pp. 145–161). Mahwah, NJ: Lawrence Erlbaum Associates. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo: Morgan Kaufmann Publishers. Piquemal, M., Schwartz, J. L., Berthommier, F., Lallouache, T., & Escudier, P. (1996). Détection et localisation auditive d’explosions consonantiques dans des séquences VCV bruitées. Actes des XXIemes Journées d’études sur la parole, SFA, 143–146. Pols, L. C. W. (1975). Analysis and synthesis of speech using a broad-band spectral representation. In G. Fant & M. A. A. Tatham (Eds.), Auditory analysis and perception of speech (pp. 23–36). London: Academic. Recasens, D. (1991). An electropalatographic and acoustic study of consinant-to-vowel coarticulation. Journal of Phonetics, 19, 177–192. Savariaux, C., Perrier, P., & Orliaguet, J. P. (1995). Compensation strategies for the perturbation of the rounded vowel [u] using a lip-tube: A study of the control space in speech production. J. Acoust. Soc. Am., 98, 2428–2442. Schroeder, M. R., Atal, B. S., & Hall, J. L. (1979). Objective measure of certain speech signal degradations based on masking properties of human auditory perception. In B. Lindblom and S. Ohman (Eds.), Frontiers of speech communication research (pp. 217–229). London: Academic Press.

Building a talking baby robot

Schwartz, J. L., Abry, C., Boë, L. J., & Cathiard, M. (2002). Phonology in a theory of perceptionfor-action-control. In J. Durand & B. Laks (Eds.), Phonetics, Phonology and Cognition (pp. 255–280). Oxford: Oxford University Press. Schwartz, J. L., Arrouas, Y., Beautemps, D., & Escudier, P. (1992). Auditory analysis of speech gestures. In M. E. H. Schouten (Ed.), The auditory processing of speech — From sounds to words (pp. 239–252). [Speech Research, 10.] Berlin: Mouton de Gruyter. Schwartz, J. L., & Boë, L. J. (2000). Predicting palatal contacts from jaw and tongue commands: a new sensory model and its potential use in speech control. 5th Seminar on speech production: Models and data. Schwartz, J. L., Boë, L. J., Vallée, N., & Abry, C. (1997). The dispersion-focalization theory of vowel systems. Journal of Phonetics, 25, 255–286. Schwartz, J. L., Robert-Ribes, J., & Escudier, P. (1998). Ten years after Summerfield … a taxonomy of models for audiovisual fusion in speech perception. In R. Campbell, B. Dodd & D. Burnham (Eds.), Hearing by eye, II. Perspectives and directions in research on audiovisual aspects of language processing (pp. 85–108). Hove: Psychology Press. Serkhane, J. E., & Schwartz, J. L. (2003). Simulating vocal imitation in infants, using a growth articulatory model and speech robotics. Proc. ICPhS, Barcelona, 2241–2245. Serkhane, J., Schwartz, J. L., Boë, L. J., Davis, B., & Matyear, C. (2002). Motor specifications of a baby robot via the analysis of infants’ vocalizations. ICSLP’2002, Denver, Colorado, 45–48. Steels, L. (1998). Synthesising the origins of language and meaning using co-evolution, self oprganisation and level formation. In J. R. Hurford, M. Studdert-Kennedy & C. Knight (Eds.), Approaches to the evolution of language (pp. 384–404). Cambridge: Cambridge University Press. Thrun, S. (1998). Bayesian landmark learning for mobile robot localization. Machine Learning, 33, 41–76. Vilain, A., Abry, C., & Badin, P. (2000). Coproduction strategies in French VCVC: Confronting Ohman’s model with adult and developmental articulatory data. Proc.5th Seminar on Speech Production, Munich, Germany, pp. 81–84. Wood, S. (1979). A radiographic analysis of constriction locations for vowels. J. Phon., 7, 25–43. Wu, Z. L., Schwartz, J. L., & Escudier, P. (1996). Physiologically plausible modules for the detection of articulatory-acoustic events. In B. Ainsworth (Ed.), Advances in speech, hearing and language processing, Vol. 3: Cochlear nucleus (pp. 479–495). London: JAI Press.

About the authors Jihène Emena Serkhane works as a PhD student in the Speech Perception Group at ICP (Institut de la Communication Parlée), since 2001. She took her undergraduate degree in Biology of Organism and Population (2000) from UCBL (Université Claude Bernard Lyon I), and her Master degree in Cognitive Sciences (2001) from INPG (Institut National Polytechnique de Grenoble). Her work deals with the modeling of speech development using speech robotics and Bayesian formalism, with a special concern for elaborating realistic models that may help analyze actual data in order to conceive new hypothesis about how speech emerges from embodied mechanisms. Jean-Luc Schwartz, a member of the French CNRS, lead the Speech Perception Group at ICP from 1987 to 2002, and has now been leading the laboratory since 2003. His main research areas involve auditory modelling, speech perception, bimodal integration, perceptuo-motor interac-

237

238

J. Serkhane, J. L. Schwartz and P. Bessière

tions, speech robotics and the emergence of language. He has been involved in various national and European projects, and has authored or co-authored more than 35 publications in international journals, 20 book chapters, and 100 presentations in national and international Workshops. He is the co-editor of a book on speech communication, of a special issue of the Speech Communication and the Primatologie journals, and a co-organiser of the last AVSP conference in 2003. Now at GIPSA-LAB. Pierre Bessière is a senior researcher at CNRS since 1992. He took his PhD (1983) in computer science from INPG. He did a Post-Doctorate at Stanford Research Institute within a project for NASA. He then worked in an industrial company as the leader of different artificial intelligence projects. Since 1992, he leads the LAPLACE research group (www-laplace.imag.fr) on probabilistic reasoning for perception, inference and action. He also leads the BIBA (Bayesian Inspired Brain and Artefact) European project. He is a founder and scientific adviser of the ProBAYES company which sells Bayesian solutions for the industry.

Aspects of descriptive, referential, and information structure in phrasal semantics A construction-based model Peter F. Dominey Institut des Sciences Cognitives, CNRS UMR 5015

Phrasal semantics is concerned with how the meaning of a sentence is composed both from the meaning of the constituent words, and from extra meaning contained within the structural organization of the sentence itself. In this context, grammatical constructions correspond to form-meaning mappings that essentially capture this “extra” meaning and allow its representation. The current research examines how a computational model of language processing based on a construction grammar approach can account for aspects of descriptive, referential and information content of phrasal semantics. Keywords: cognitive development, grammatical deixis, language acquisition, perceptual scene analysis, lexical categorization

1. Introduction Part of the great expressive power of language is the ability to specify not only “who did what to whom,” but also the capability to nuance this thematic content with additional informational dimensions. Thus, the same event structure can be described in a number of different manners in order to emphasize different aspects of the contents. Consider, for example, the following state of affairs: accept(reviewer, paper), explain(paper, evolution), where meaning is encoded in a predicate(agent, patient) format. Depending on his or her discourse or pragmatic goals, the speaker can describe this situation with different sentences, including:

(1) (2) (3) (4)

The paper that explains evolution was accepted by the reviewer. The reviewer accepted the paper that explains evolution. The paper that was accepted by the reviewer explains evolution. Evolution was explained by the paper that was accepted by the reviewer.

240

Peter F. Dominey

From these examples, it can be observed that these different sentence structures direct or focus the attention of the listener on different aspects of the corresponding event semantics. In the context of “vocalize to localize”, this corresponds to a form of joint attention or grammatical deixis mechanism (Lœvenbruck et al., 2005) that is provided by phrasal semantics. A theory of language acquisition and processing should include a functional characterization of how phrasal semantics can be implemented and used to provide this expressive capability. The goal of the current research is to propose and validate a neuro-computationally plausible framework for how this could be implemented. The problem of language acquisition can be posed in the following manner: given a set of 〈sentence, meaning〉 pairs, the child should learn which sentences are associated with which meanings, and should be able to generalize this knowledge to new sentences (e.g. Crain & Lillo-Martin, 1999: 56; Feldman et al., 1990). The child comes to this task equipped with some innate learning capabilities that are often referred to as the “initial state” or the language acquisition device (LAD). The 〈sentence, meaning〉 pairs are referred to as the primary linguistic data (PLD), and the result of learning is the adult state. One school of thought, associated with Chomsky (e.g., 1965, 1995) holds that the PLD is highly indeterminate and underspecifies the mapping to be learned. Thus, they propose that the LAD embodies a genetically pre-specified syntactic system or universal grammar (UG), and that language acquisition consists of setting the UG parameters to correspond to those for the target language. This implies the “continuity hypothesis,” which holds that via UG children have an adult-like syntactic system that they bring to the problem of language acquisition (Pinker, 1984, and see discussion in Tomasello, 2000). This school thus argues for a UG in which the essential structure of the grammar is innate, and they propose that what is learned are the values of parameters that identify the target language grammar within the UG framework. Partially because of this endowment of UG, this school advocates the characterization of grammars in terms of formal syntactic regularities that can be characterized largely independently of semantics and pragmatics. A separate “functionalist” school associated with authors including but not limited to Talmy (1988), Feldman and Lackoff et al. (1990), Langacker (1991), Goldberg (1995, 1998, 2003), Tomasello (1998, 1999, 2000, 2003) and others (see Newmeyer, 1999, and papers from Tomasello, 1998) holds that the LAD does not contain a parameterized “universal grammar” but is rather a mechanism that learns the mapping between grammatical forms and meanings, (grammatical constructions) emphasizing the importance of communicative and social functions in language acquisition. This school places a much greater emphasis on the concrete relation between grammatical forms and meaning, and thus diminishes the independent significance of abstract generative syntactic rules.

Aspects of descriptive, referential, and information structure in phrasal semantics

In contrast with the continuity hypothesis, this framework is based in part on observations that the first grammatical constructions employed by infants appear more appropriately considered in terms of idiom-like linguistic gestalts that are initially fixed, and that through a progressive usage-based analysis become more open and productive (reviewed in Tomasello, 2000, 2003). In this context, the competence of the speaker is characterized as a structured inventory of grammatical constructions, rather than an abstract generative grammar. This view corresponds to the common ground for the construction grammar perspectives of Goldberg (1995, 1998, 2003), and Croft (2001) and the usage based approach to language acquisition of Tomasello (2003). Interestingly, Jackendoff (2002) considers that grammatical constructions can rightfully take their place as lexical items, thus blurring the distinction between lexical items and rules of the grammar. Dominey (2000/2002) presented a construction based model of lexical and phrasal semantics that demonstrated capabilities for argument structure satisfaction, or thematic role assignment, with a relatively restricted set of active and passive grammatical constructions. The effort in that study was to examine the importance of the interaction between meaning and grammatical structure. In the current study, the exploration of the construction model is extended and demonstrated to account for multiple aspects of phrasal semantics. Jackendoff (2003) has recently proposed a three tiered framework for describing phrasal semantics, or the manner in which sentence structure communicates meaning beyond that of the sum of the lexical elements. In this framework, phrasal semantics is organized into descriptive, referential and information/focus tiers. The descriptive tier includes thematic role assignment and associated argument satisfaction (i.e., the specification of “who did what to whom”), and the associated combinatorial ability to build relative clauses. In this context we will demonstrate how grammatical constructions indexed by word order and grammatical marking fulfill these functional criteria. The referential encompasses the existential or referential content of a sentence. In this context we will demonstrate how the grammatical construction framework allows for resolution of particular types of pronoun reference and reflexive verb argument assignments. Finally, the information topic/focus tier includes representation of pragmatic focus that involves, for example, moving the thematic object to the head of the sentence in order to place a discourse focus on this element. Considering the following sentences: John pushed the block, and The block was pushed by John, one can see that these differ in their information content, due to the focus component. Again, by permitting this word ordering flexibility, phrasal semantics provides a powerful mechanism for grammatical deixis. In the next section the functional organization of the model is spelled out, and in Section 3 the performance of the model is described.

241

242

Peter F. Dominey

Part of the claim of the current research is that certain not-trivial aspects of the sentence-meaning relations in language can be learned. Such claims can be tested in the context of robotic systems that employ sensory perception for extracting meaning from the environment, which is then paired with verbal messages to provide input to the learning system. Section 3 will provide an overview of our debut in this line of research. 2. Sentence to meaning mapping (SM2) model The model architecture is presented in Figure 1. From a behavioral perspective, during learning, the model is presented with 〈sentence, meaning〉 pairs, and it should learn the word meanings, and the set of grammatical constructions that define the sentence to meaning mappings in the input training set. During testing, the model should demonstrate that it can use this knowledge to understand new sentences that use the same lexicon, and the same set of grammatical constructions, but that were not presented in the training set. In particular the model should demonstrate systematicity, such that words that have only been experienced in particular syntactic roles (e.g., subject in an SVO sentence) will be correctly processed when they appear in new legal syntactic positions (e.g., the same word now as an object in an SVO sentence). The functional organization of the model is based on the following principles: (1) Language acquisition can be characterized as learning the mappings from grammatical form to meaning (i.e., grammatical constructions) that should allow productive generalization with the learned constructions; (2) Within the sentence, the construction is encoded or identified by the relative configuration of open and closed class elements, that can thus be used as an index by which the corresponding construction for that sentence type can be learned and retrieved. These concepts are presented in an overview in Figure 1. The following sections then describe the model in detail. 2.1 Input representations The two inputs to the system are sentences, and meanings. 2.1.1 Sentence input Sentences are encoded as linear sequences of words that are identified on input as being open or closed class elements. This lexical categorization is among the early language-related perceptual distinctions learned by infants based on perceptible

Aspects of descriptive, referential, and information structure in phrasal semantics

Figure 1. Sentence to Meaning Mapping (SM2) Architecture for language learning. Lexical semantics: Open class words in OpenClassArray are translated to their corresponding referents in the PredictedReferentsArray via the WordToReferent mapping. Phrasal semantics: The PredictedReferentsArray elements are then mapped onto their respective roles in the scene SceneEventArray by the FormToMeaning mapping, specific to each construction type. FormToMeaning is retrieved from ConstructionInventory, via the ConstructionIndex that encodes the closed class function words and their relative position with respect to open class words, that uniquely characterizes each grammatical construction type.

cues in the auditory signal (Shi et al., 1999, Höhle & Weissenborn, 2003, see papers in Morgan & Demuth 1996). We have recently demonstrated that in French and English, the temporal profile of the fundamental frequency (F0) of the speech signal provides a reliable cue for categorizing open and closed class words (Blanc et al., 2003). Related simulation studies in which prosody is symbolically encoded have also demonstrated successful results (Morgan et al., 1996). The result of this early discrimination capacity applied to the child’s target language is subsequently expressed in adulthood. Indeed, in adults, extensive data from event related potentials, brain imagery and psycholinguistic studies indicate that these lexical categories are processed by distinct and dissociated neurophysiological pathways (e.g., Kluender & Kutas 1993; Friederici, 1985; Pulvermüller, 1995; Brown et al., 1999).

243

244

Peter F. Dominey

In the model, words are represented as 25 element vectors, with content words coded with single bit in the range 1–16, and function words in 17–25. The content (or open class) words will be encoded in the Open Class Array (OCA) that contains 6 fields, each a 25-element vector with single bit encoding. The function (or closed class) words are encoded in a vector called the Construction Index described below. Additionally, each sentence is initiated by a closed class start symbol and terminated by a closed class end symbol. 2.1.2 Meaning input If language acquisition is the learning of a mapping between sentences and meanings, then the infant must have some pre-linguistic capacity for representing this meaning. Well before their first birthday, infants can extract meaning from visual scenes and demonstrate the ability to understand physical properties of object interaction, and goal directed actions (e.g. Woodward, 1998; Carey & Xu, 2000; Bellagamba & Tomasello, 1999; Meltzoff, 1995; Mandler, 1996; Talmy, 1988; Kotovsky & Baillargeon, 1998). This implies the existence of conceptual representations of events that can be instantiated by non-linguistic (e.g., visual) perceptual input prior to the development of language. These conceptual representations will form the framework upon which the mapping between linguistic and conceptual structure can be built. This approach does not exclude the possibility that the conceptual representation capability will become more sophisticated in parallel with linguistic development (see Bowerman & Levinson, 2001 for a survey of the issue). It does require, however, that at least a primitive conceptualization capability that can deal with events in a predicate-argument format exists in a pre-linguistic state. Indeed, Fisher (1996) has identified the requirement that event representations should take a predicate-argument structure that is related to the grammatical structure of the verb onto which they will be mapped. Likewise, in elaborating the structural relations between linguistic and conceptual forms Jackendoff considers that predicate/argument structure is a central feature of semantic structures (Jackendoff, 2002, p. 123). Similarly, this type of abstract predicate/argument event representation is central to the structure to meaning mapping for grammatical constructions as characterized by Goldberg (1995, 1998, 2003). Thus, the “meaning” onto which the sentence is to be mapped takes this predicate(argument) form, encoded in the Scene Event Array (SEA) that consists of two sub-arrays that contain fields corresponding to action, agent, object, recipient/source. Each field is a 25-element vector with a single bit encoding. The SEA thus allows representation of the simple events (e.g., give, take, push, touch), as well as their combinations in hierarchical events, described below.

Aspects of descriptive, referential, and information structure in phrasal semantics

2.2 Learning word meanings: Lexical semantics In the initial learning phases, the association between a word (in the OpenClassArray OCA) and its corresponding referent (in the SceneEventsArray SEA) is learned and stored in the associative memory of the WordToReferent matrix (Eqn 1). The parameter α specifies the influence of syntactic knowledge in "zooming in" on the appropriate word to referent mapping. In the initial configuration, prior to the accumulation of syntactic/grammatical construction knowledge, the term α is 1, and this learning simply associates every word with every element in the current scene. This exploits a form of cross situational learning, in which the correct wordreferent associations will emerge as those which remain constant across multiple sentence-scene situations (Siskind, 1996). In this manner the system can extract the cross-situational regularity that a given word will have a higher coincidence with the referent to which it refers than with other objects. This allows initial word learning to occur, which contributes to learning the mapping between sentence and scene structure (Eqn. 4, 5 & 6 below). Note that this first level has been addressed by Siskind (1996), Roy and Pentland (2000), and Steels (2001) and we treat it here in a relatively simple but effective manner, with more attention to the interaction between lexical and phrasal semantics. Once this learning has occurred, knowledge of the grammatical structure, encoded in FormToMeaning can be used to "zoom in on" or identify the appropriate referent (in the SEA) for a given word (in the OCA). FormToMeaning is an associative memory that specifies for each construction type, the mapping from elements in the OCA to elements in the SEA. Exploiting this knowledge allows the system to avoid mapping open class elements to the wrong scene referent elements. Functionally this corresponds to a zero value of α in Eqn. 1. In this configuration, only the mapping between the word and its grammatically identified scene referent (i.e. the one specified in FormToMeaning) is strengthened. This corresponds to a form of "syntactic bootstrapping" in word learning. Thus, for the new word "gugle", knowledge of the appropriate grammatical construction for the sentence "John pushed the gugle" can be used to assign "gugle" to the object of push. LexLearningRate is a scalar valued parameter that specifies the “learning rate” or rate of change in weights in the WordToReferent matrix. WordToReferent(i,j) = WordToReferent(i,j) + OCA(k,i) * SEA(m,j) * LexLearningRate * Max(α, FormToMeaning(m,k))

(1)

245

246

Peter F. Dominey

2.3 Mapping sentence to meaning: Phrasal semantics The objective of phrasal semantics in the current context is to determine the mapping from sentence to meaning, particularly with respect to thematic role assignment and the related issues of phrasal semantics as outlined above. The learning task for the model is, given a set of 〈sentence, meaning〉 input pairs, to acquire the corresponding inventory of grammatical constructions that accounts for those sentences, and that can generalize to all new sentences within that set of grammatical constructions. In terms of the architecture in Figure 1 with the example sentence “The block was pushed by the triangle,” the underlying processes are defined in the following successive steps. First, words in the Open Class Array (block, pushed, triangle) are decoded into their corresponding scene referents (via the WordToReferent mapping described above) to yield the Predicted Referents Array (Eqn 2) that thus contains the translated words (block, pushed, triangle) while preserving their original order from the OCA. PRA(k,j) =

n

∑ OCA(k,i) * Word-to-World(i,j) i=1

(2)

The grammatical construction for the input sentence corresponds to a specific mapping between referents (in PRA) and the components of the meaning representation (in SceneEventArray SEA). This mapping is encoded in the FormToMeaning array. The problem will be to store and retrieve, for each grammatical form, the appropriate corresponding FormToMeaning mapping, i.e. the construction. To solve this problem, we must extract from each grammatical form a unique corresponding Construction Index, based on lexical category, word order and grammatical marking (Bates et al., 1982). Then, the appropriate FormToMeaning mapping for each grammatical form can be indexed by its corresponding Construction Index. We first consider how the ConstructionIndex is generated, and then how it is associated with the appropriate FormToMeaning mapping. The Construction Index (Eqn.3) encodes the grammatical structure of a sentence in terms of the function words and their relative position with respect to content words in the sentence. It is thus a re-coding of the sentence in which both position and identity of function words is preserved, while for content words, only position is preserved (yielding something like “___ was ___ by ___” for the example in Figure 1). Since each grammatical form or construction has a unique configuration of function and content words, with respect to their identity, order and relative position, the Construction Index will thus uniquely identify each distinct grammatical form. The Construction Index is a 25 element vector. Each function word is encoded as a single bit in the 25 element FunctionWord vector. When a

Aspects of descriptive, referential, and information structure in phrasal semantics

function word is encountered during sentence processing, the current contents of Construction Index are shifted by n + m bits in a ring buffer (indicated by fshift) where n corresponds to the bit that is on in the current FunctionWord vector, and m corresponds to the number of open class words that have been encountered since the previous function word (or the beginning of the sentence). Finally, a vector addition is performed on this result and the FunctionWord vector (i.e., bit n is set). In other words, for bit k in the current ConstructionIndex, the bit corresponding to (k + n + m) modulo 25 will be set in the new ConstructionIndex, and finally bit n will be set. While this may seem obscure, the desired result is that the ConstructionIndex should uniquely code or represent distinct grammatical constructions as a function of their relative configurations of open and closed class words. Equation 3 was designed to carry out this discrimination function purpose, and will be demonstrated to behave as desired, though clearly there may exist potentially superior alternatives to be explored in the future. Construction Index = fshift(Construction Index, (n + m)) + FunctionWord

(3)

The link between the Construction Index and the corresponding FormToMeaning mapping is established as follows. During training, as each input sentence is processed, we reconstruct the specific FormToMeaning mapping for that sentence (Eqn 4). The resulting FormToMeaningCurrent encodes the correspondence between word order that is preserved in the Predicted Referents Array PRA Eqn 2 (block, pushed, triangle in the example in Figure 1) and thematic roles in the SEA (pushed, triangle, block in the example in Figure 1). Note that the quality of FormToMeaningCurrent will depend on the quality of acquired word meanings in WordToReferent used to populate the PRA. Thus, syntactic learning requires a minimum baseline of semantic knowledge, corresponding to the “asyntactic first pass” discussed by Gillette et al. (1999). Given the FormToMeaningCurrent mapping for the current sentence, we can now associate it with the corresponding Construction Index for that sentence (Eqn 5), storing this association in the ConstructionInventory associative memory. n

FormToMeaningCurrent(m,k) = ∑ PRA(k,i)*SEA(m,i) i=1

ConstructionInventory(i,j) = (ConstructionInventory(i,j) + Construction Index(i) * Sentence-to-World-Current(j) * PhrasalLearningRate) / Sum(ConstructionInventory)

(4)

(5)

247

248

Peter F. Dominey

Finally, once a construction has been learned via this mechanism, for new sentences we can extract the FormToMeaning mapping from the learned ConstructionInventory by using the Construction Index literally as an index into this associative memory, illustrated in Eqn. 6. n

FormToMeaning(i) = ∑ ConstructionInventory(i,j) * ConstructionIndex(j)

(6)

i=1

It should also be noted that the associative memory in ConstructionInventory is subject to the standard hazards of simple associative memories of this type. In particular, for ConstructionIndex vectors that are similar, there may be retrieval errors. For this reason, we have also implemented ConstructionInventory in a more robust and functionally equivalent lookup table where ConstructionIndex acts as an index, and the appropriate FormToMeaning map is stored/retrieved. An advantage of this method is that constructions are discretely coded and can be analysed post-hoc, e.g. the number of constructions required to account for an input corpus can be quantified. In addition to simple 〈sentence, meaning〉 pairs such as 〈”The block pushed the ball”, push(block, ball)〉, we will also consider hierarchically structured pairs such as 〈“The block that pushed the ball touched the triangle”, push(block, ball), touch(block, triangle)〉 that employs a relativised sentence and a dual-event scene. To accommodate the dual scenes for such complex events, Eqns. 4–7 are instantiated twice each, to represent the two components of the dual scene. In the case of simple scenes, the second component of the dual scene representation is null. Likewise, there are two instances of the following data structures: ConstructionInventory, FormToMeaning, FormToMeaningCurrent, and SceneEventArray, to account for the dual event scenes, and the corresponding mapping mechanisms. We evaluate performance by using the WordToReferent and FormToMeaning knowledge to construct for a given input sentence the “predicted scene”. That is, the model will construct an internal representation of the scene that should correspond to the input sentence. This is achieved by first converting the OpenClassArray into its corresponding scene items in the PredictedReferentsArray as specified in Eqn. 2. The referents are then re-ordered into the proper scene representation via application of the FormToMeaning transformation as described in Eqn. 7, after FormToMeaning is retrieved from the ConstructionInventory with the ConstructionIndex as described in Eqn. 6. PSA(m,i) = PRA(k,i) * FormToMeaning(m,k)

(7)

When learning has proceeded correctly, the predicted scene array (PSA) contents should match those of the scene event array (SEA) that is directly derived from

Aspects of descriptive, referential, and information structure in phrasal semantics

input to the model. We then quantify performance error in terms of the number of scene interpretation errors, or mismatches between PSA and SEA.

3. Model performance We will now examine how this model of grammatical construction learning can address the issues of phrasal semantics as outlined in the introduction. Part of the limitation on the complexity of the grammar studied in Dominey (2000) was due to the simple structure of the meanings or scene representations that were employed, that consisted of a single event with three arguments. This allows sentence types including active, dative, passive and dative passive. However, a sentence with a relativised subject, such as “The block that was pushed by the moon touched the triangle” corresponds in fact to two distinct events, that could be represented as push(moon, block), and touch(block, triangle). Here, we will first observe the benefits that a more complex scene representation can provide for the development of more complex grammatical structures, particularly relativised phrases. 3.1 Aspects of the descriptive and information tiers As indicated above, the model should address aspects of the descriptive tier that includes thematic role assignment and associated argument satisfaction, as well as the combinatorial ability to build relative clauses. Likewise, it should address aspects of the information topic/focus tier that includes the capability for expression of pragmatic focus that involves, for example, moving the thematic object to the head of the sentence in order to place a discourse focus on this element. Figure 1, and the description in Section 2.3 have described how the model performs simple argument structure satisfaction in the context of grammatical constructions. However, one of the hallmarks of human language is the ability to cope with hierarchical complexity in sentences. From a syntactic perspective, this complexity has been extensively analyzed, and the rules that govern the relations between components at different hierarchical levels are the object of an extensive body of research (e.g., Chomsky 1995). One can thus consider abstract hierarchical structure in the absence of meaning, as in autonomous syntax. In this absence of meaning, however, the purpose of hierarchical structure remains rather abstract. From the perspective of meaning, or semantics, however, the purpose of this hierarchical structure becomes strikingly functional. This is reflected in the language processing architecture suggested by Jackendoff (1999, 2002), in which the conceptual/semantic component has rich combinatorial hierarchical structure independent of (but mapped onto) syntactic structure.

249

250

Peter F. Dominey

Consider for example, the sequence of events depicted in the lower left corner of Figure 2. In these two scenes, the common element is “block.” Depending on the discourse focus (i.e. the head of the sentence) this complex scene can be described in different ways. We can consider the sentence where block is in the focus with an active verb yielding: “The block that was touched by the triangle pushed the circle.” Note that for the sake of consistency, we adopt the convention that the referent of the relative clause is always the first of the two scene events. The same scene can also be described by a sentence that places the item “circle” in the discourse focus: “The circle was pushed by the block that was touched by the triangle.” Interestingly here we see that if focus is taken into account, then the apparent redundancy between the active and passive forms is eliminated. An example of the processing of this relativised sentence “The block that was touched by the triangle pushed the circle” is provided in Figure 2. As for simple (single event) sentences, the OpenClassArray elements are translated to their referent via the lexical semantics information in WordToReferent, thus populating the PredictedReferentsArray. The meaning component of the 〈sentence, meaning〉 pair for this sentence corresponds to two distinct events: touch(triangle, block), push(block, circle). These two event representations are linked by the common element block, thus forming a hierarchically linked semantic representation that corresponds to the hierarchical structure of the relative sentence. The task now is to map these PredictedReferentsArray elements onto the dual event structure of the

Figure 2. Example of Relativized Sentence Processing. The hierarchical structure of the sentence is reflected in the structure of the semantic representation. See text.

Aspects of descriptive, referential, and information structure in phrasal semantics

SceneEventsArrays. As mentioned above in the model description, to account for the dual event scenes and the corresponding mapping mechanisms there are two instances of the following data structures: ConstructionInventory, FormToMeaning, FormToMeaningCurrent, and SceneEventArray. Thus, the ConstructionIndex will retrieve FormToMeaning mappings corresponding to the two scene events. Each will respectively map the contents of the PredictedReferentsArray onto the appropriate elements in the two SceneEventsArrays. Note once again that in the current example, the referent block maps onto different thematic roles in the two events — reflecting the linked hierarchical structure of the relative phrase. The current study thus exposes the model to grammatical construction Types 1–26 of the Appendix. In this experiment ConstructionInventory functions as a lookup table, functionally equivalent to the associative memory but more efficient, where ConstructionIndex acts as an index, and the appropriate SentenceToWorld map is stored/retrieved. The model learns the 26 sentence types without errors, demonstrating that the ConstructionIndex is robust to the structural variability in these sentence types. In other words, each of the different grammatical constructions generates a distinct ConstructionIndex as defined by Eqn. 3, and thus the appropriate FormToMeaning mapping can be stored and retrieved using the ConstructionIndex as an index into the ConstructionInventory. In this manner, the model has been demonstrated to generalize without error to new sentences that (1) use words that have been learned and stored in the WordToReferent associative memory, and that (2) use grammatical constructions that have been previously learned and stored in the FormToMeaning associative memory. Returning to the examples of relativised sentences (1–4) presented in the introduction, we can now see in the Appendix that they correspond to construction types 8–11. In these elements of the appendix, we see the grammatical constructions defined in terms of the sentence structure “frames” and the corresponding semantic structure or meaning “frames”. This demonstrates that the concept of grammatical constructions as form to meaning mappings quite adequately captures the phrasal semantic requirements for the expression of relativised noun phrases, as well as the liberation from fixed word order (e.g. in passive forms) that allows a form of syntactic deixis. 3.2 Aspects of referential tier and beyond From the perspective of the referential tier, the model should also demonstrate how the grammatical construction framework allows for resolution of pronoun reference and reflexive verb argument assignments. As illustrated above, “dual”

251

252

Peter F. Dominey

events in the meaning representation allow the use of hierarchical meanings that correspond to relative clauses. The current experiment demonstrates how dual events also support additional sentence types including: conjoined (John took the key and opened the door), reflexive (The boy said that the dog was chased by the cat), and reflexive pronoun (The block said that it pushed the cylinder) sentence types (Types 27–38). The consideration of these sentence types compels us to address the question of how their meanings are represented. Conjoined sentences are represented by the two corresponding events, e.g., took(John, key), open(John, door) for the example above. Reflexives are represented, for example, as said(boy), chased(cat, dog). This assumes for reflexive verbs (e.g., said, saw), that the meaning representation includes the second event as an argument to the first. Finally, for the reflexive pronoun types, in the meaning representation the pronoun’s referent is explicit, as in said(block), push(block, cylinder) for “The block said that it pushed the cylinder.” Note that the pronoun is treated as a closed class element and thus encoded in the ConstructionIndex, and not in the OpenClassArray. The net result is that the behavior of the system is correct — the sentence is reliably mapped onto its meaning. At a finer level, however, the referent for “it” is not explicitly represented. To begin to address this issue, one could channel incoming pronouns both to the ConstructionIndex, and directly to the predicted referents array without decoding of their lexical semantics. This would then potentially allow on-line binding between pronouns and their corresponding scene elements via the FormToMeaning mapping. Based on 〈sentence, meaning〉 pairs as thus specified, the model learns the appropriate FormToMeaning mappings for the ensemble of sentence types in the Appendix. This demonstrates that the 38 sentence types are structurally distinct, as the system extracts unique ConstructionIndices for each of them. This allows, for each type, the binding of the appropriate FormToMeaning mapping to the corresponding ConstructionIndex. In the first exposure to a new construction type, the model matches the predicted referents with the scene event elements to determine the current form to meaning mapping. This mapping is stored in the ConstructionInventory, indexed by the ConstructionIndex unique to that sentence type. The observation here that the model can learn the 38 constructions from the appendix confirms that at least for these constructions there is a unique pattern of closed class items for each, and that this can be used to store and retrieve the form to meaning mappings.

Aspects of descriptive, referential, and information structure in phrasal semantics

3.3 Robot language acquisition in the construction framework Given the demonstrated ability of the model to learn sentence to meaning mappings, we have recently started to explore the use of this model in a robot language learning context (Dominey, 2003; Dominey & Boucher, 2005). In these experiments, a human experimenter manipulates toy blocks in the field of view of a computer vision system, and simultaneously narrates her actions, something like “The block was pushed by the triangle that touched the moon”. This involves the automatic extraction of sentences from speech using standard human language technology tools, and the extraction of meaning from visual scenes. For meaning extraction, we use an approach similar to that of Siskind (2001) but much simpler, with off-the-shelf (SmartVision Panlab) color based computer vision for object recognition and tracking in order to extract contact events between objects in dynamic scenes. Then, events such as touch, push, take, give are parsed from the stream of contacts based on the definition of each of these event types in terms of a specific pattern of contact or contact sequence. Causality is attributed as a function of relative velocity of objects involved in a contact, i.e. the object that was moving faster towards the other object “did it”. The events are coded in predicate(argument) form as described above. Thus, from live human generated and narrated events, our robotic system can extract 〈sentence, meaning〉 pairs, and use the construction based model to learn the underlying grammatical constructions (Dominey, 2003a). This provides a concrete demonstration, as proposed by Goldberg (1995), of the tight correspondence between the structure of perceptual events that are basic to human experience, and the constructions for the corresponding basic sentence types.

4. Discussion The sentence to meaning mapping model presented here embodies central aspects of construction grammar (Goldberg, 1995, 1998; Croft 2001) and its application in a usage-based characterization of language acquisition (Tomasello, 2003), along with central tenets of the cue competition and coalition framework (Bates & MacWhinney, 1982). This can be expressed as the following principles: (P1) Language acquisition can be characterized as learning constructions or mappings from grammatical form to meaning that allow productive generalization with the learned constructions; (P2) Within the sentence, the construction is encoded or identified by the relative order or configuration of open and closed class elements that can thus be used as an index by which the corresponding construction for that sentence type can be learned and retrieved. Evaluation of the model with respect

253

254

Peter F. Dominey

to its ability to fulfill the requirements of a lexical semantics system thus serves as a form of evaluation of these theories (to the extent that they are embodied in the model). The model has developed in a line of research that attempts to explain aspects of language processing in the context of cognitive sequence processing (Dominey et al., 2003; Dominey & Ramus, 2000; Lelekov et al., 2000; Hoen et al., 2003). While the current exposition has been limited to English, we have recently demonstrated that the 2 principals of the model are applicable in a cross-linguistic validation to Japanese (Dominey & Inui, 2004). The stated objective of this research was to examine the abilities of a construction based model of language processing to accommodate a well defined subset of the functional requirements of phrasal semantics. Clearly the whole job of phrasal semantics is an immense research project, hence the liberal use of the word “aspects” to signify the limited nature of the current analysis. Given this proviso, let us consider to what extent the objectives have been realized. With respect to the descriptive tier, we have seen that the construction model handles a variety of abstract constructions, employing a novel and effective method for argument structure satisfaction, or thematic role assignment. In this context, the system also demonstrates a novel and effective method for accommodating relative clauses in NPs, assigning an important role to the hierarchical structure of meaning in driving that of grammatical structure. This enters into the current discussion concerning the hierarchical and recursive structure of different dimensions of linguistic structure. In this context, Hauser et al. (2002) have proposed that syntax alone possesses a capability for recursive structure, while Jackendoff (1999, 2003) argues that phonology, semantics and syntax are all independently recursive. The current study suggests that hierarchical structure in semantic or conceptual representations provides the structural framework onto which sentence structure is mapped. This could be extended to propose that the recursive and compositional structure of syntax is derived from that of the conceptual structure that it expresses. With respect to the information/focus tier, the model clearly demonstrates that the adopted grammatical construction approach is quite adequate for allowing the use of multiple non-canonical word orderings in order to rather precisely manipulate the focus or informational content of different grammatical constructions. Similarly, in the context of the referential tier, the system demonstrates the capability for reflexive and reflexive pronoun constructions. From a developmental perspective, the construction paradigm provides an easy access for the infant into the world of utterance level language. In the earliest stages of utterance understanding, the child appears to treat sentences as idiomlike holophrases, gradually liberating these fixed constructions with increasing abstraction or schematicity. The resulting abstract construction capability provides

Aspects of descriptive, referential, and information structure in phrasal semantics

the advantage of an ability to easily acquire a variety of construction types that allow systematic generalization to new sentences within the domain of the learned constructions. This is achieved with a striking minimum of “syntactic” machinery, that is replaced by structural mapping capabilities. As it stands the system suffers one significant limitation. The limitation is that all new constructions must be acquired by learning. That is, the system must be exposed to one 〈sentence, meaning〉 pair that is representative of the new construction in order to acquire the mapping. This is likely the case for children up to around two years of age (Tomasello, 2000, 2003; Clark, 2003). But clearly, the human language capacity includes the ability to produce and to comprehend sentences derived from novel grammatical constructions with no previous exposure to those constructions. Addressing this problem will require “fractionating” the current level of treatment of grammatical constructions into smaller units that would include noun phrases (Miikkulainen, 1996). This would allow pattern finding mechanisms to operate at the level of these subphrasal construction components that could then be recombined to provide an on line construction generation capability. In the mean time, the current research advances the current state of affairs by demonstrating that a model of language processing based on tenets from construction grammar (Goldberg, 1995, 1998) and the coding of phrasal semantics (Bates & MacWhinney, 1982) can begin to account for interesting aspects of phrasal semantics (Jackendoff, 2003).

Acknowledgements This work was supported by the ACI for Integrative and Computational Neuroscience, the European Eurocores Origin of Man, Language and Languages Project, and the HFSP.

References Bates, E., McNew, S., MacWhinney B., Devescovi, A., & Smith, S. (1982). Functional constraints on sentence processing: A cross-linguistic study. Cognition, 11, 245–299. Bellagamba, F., & Tomasello, M. (1999). Re-enacting intended acts: comparing 12- and 18-Month-Olds. Infant Behavior and Development, 22(2), 277–282 Blanc, J.M., Dodane, C., & Dominey, P.F. (2003). Temporal processing for syntax acquisition: A simulation study. 25th Annual Meeting of the Cognitive Science Society. Boston (Massachusetts), USA. Bowerman, M. (1996). Learning how to structure space for language: A crosslinguistic perspective. In P. Bloom, M. Peterson, L. Nadel, & M. Garrett (Eds.), Language and space (pp. 385– 486). Cambridge, MA: MIT Press.

255

256

Peter F. Dominey

Bowerman M., & Levinson S.C. (2001). Language acquisition and conceptual development. Cambridge: Cambridge University Press. Brown, C.M., Hagoort, P., ter Keurs, M. (1999). Electrophysiological signatures of visual lexical processing: Open- and closed-class words. Journal of Cognitive Neuroscience, 11(3), 261– 281. Carey S., & Xu F. (2001). Infant’s knowledge of objects: Beyond object files and object tracking. Cognition, 80, 179–213. Chomsky, N. (1965) Aspects of a theory of syntax. Cambridge, MA: MIT Press. Chomsky, N. (1995) The Minimalist Program. Cambrdige, MA: MIT Press. Christiansen M., & Chater N. (1999). Toward a connectionist model of recursion in human linguistic performance. Cognitive Science, 23(2), pp. 157–205. Clark, E.V. (2003). First language acquisition. Cambridge: Cambridge University Press. Crain S., & Lillo-Martin, D. (1999). An introduction to linguistic theory and language acquisition. Malden, MA: Blackwell. Croft, W. (2001). Radical construction grammar: syntactic theory in typological perspective. Oxford: Oxford University Press. Dominey, P.F. (2000/2002). Conceptual Grounding in Simulation Studies of Language Acquisition, Evolution of Communication, 4(1), 57–85. Dominey, P.F. (2003). Learning grammatical constructions in a miniature language from narrated video events. Proceedings of the 25th Annual Meeting of the Cognitive Science Society, Boston. Dominey, P.F., & Boucher, (2005). Developmental stages of perception and language acquisition in a perceptually grounded robot. Cognitive Systems Research, 6, 243–259. Dominey, P.F., Hoen, M., Lelekov, T., & Blanc, J.M. (2003). Neurological basis of language in sequential cognition: Evidence from simulation, aphasia and erp studies. Brain and Language, 86, 207–225 Dominey, P.F., Inui, T. (2004). A developmental model of syntax acquisition in the Construction Grammar framework with cross-linguistic validation in English and Japanese. Proceedings of the 20th International Conf. on Computational Linguistics: Workshop on Psycho−Computational Models of Human Language Acquisition, Geneva. Dominey, P.F., & Lelekov, T. (2000). Nonlinguistic transformation processing in agrammatic aphasia. Comment on Grodzinsky 2000. Beh. Brain Sciences, 23(1), 30. Dominey, P.F., & Ramus, F. (2000). Neural network processing of natural lanugage: I. Sensitivity to serial, temporal and abstract structure of language in the infant. Language and Cognitive Processes, 15(1), 87–127 Feldman, J.A., Lakoff, G., Stolcke, A., & Weber, S.H. (1990). Miniature language acquisition: A touchstone for cognitive science. Proc. 12 Ann. Conf. Cognitive Science Society, 686–693. Feldman, J., Lakoff, G., Bailey,D., Narayanan,S., Regier, T., & Stolcke, A. (1996). L0: The First Five Years. Artificial Intelligence Review, 10, 103–129. Fern, A., Givan, R., & Siskind, J.M. (2002). Specific-to-general learning for temporal events with applications to learning event definitions from video. Journal of Artificial Intelligence Research, 379–449. Fisher, C. (1996). Structural limits on verb mapping: The role of analogy in children’s interpretation of sentences. Cognitive Psychology, 31, 41–81 Friederici, A.D. (1985). Levels of processing and vocabulary types: evidence from on-line comprehension in normals and agrammatics. Cognition 19, 133–166.

Aspects of descriptive, referential, and information structure in phrasal semantics

Gentner, D., Medina, J. (1998) Similarity and the development of rules. Cognition, 65, 263–297. Gillette, J., Gleitman, H., Gleitman, L. & Lederer, A. (1999). Human simulation of vocabulary learning. Cognition, 73, 135–151. Gleitman, L. (1990). The structural sources of verb meanings. Language Acquisition 1, 3–55. Goldberg, A. (1995). Constructions. Chicago & London: University of Chicago Press. Goldberg, A. (1998). Patterns of experience in patterns of language. In Tomasello, M. (Ed.), The new psychology of language, 1, 203–19 Goldberg, A. (2003). Constructions: A new theoretical approach to language. Trends in Cognitive Science, 7(5), 219–224 Hauser, M,, Chomsky, N, & Fitch, T. (2002). The faculty of language: What is it, who has it, and how did it evolve? Science, 298(5598), 1569–79 Hoen, M,, Golembiowski, M., Guyot, E., Deprez, V., Caplan, D., & Dominey, P.F. (2003). Training with cognitive sequences improves syntactic comprehension in agrammatic aphasics. Neuroreport., March 3, 14(3), 495–499 Höhle, B., & Weissenborn, J. (2003). German-learning infants’ ability to detect unstressed closedclass elements in continuous speech. Developmental Science, 6(2), 122–127. Greenfield, P. M. (1991). Language, tools and brain: The ontogeny and phylogeny of hierarchically organized sequential behavior. Beh. Br. Sci 14, 531–595. Jackendoff, R. (1999). Parallel constraint-based generative theories of language. Trends Cogn. Sci., Oct., 3(10), 393–400. Jackendoff, R. (2002). Foundations of language: Brain, meaning, grammar, evolution. Oxford: Oxford University Press. Kluender, R., & Kutas, M. (1993). Bridging the gap: Evidence from ERPs on the processing of unbounded dependencies. J. Cog. Neurosci. 5, 196–214. Kotovsky L., Baillargeon, R. (1998). The development of calibration-based reasoning about collision events in young infants. Cognition, 67, 311–351 Langacker, R. (1991). Foundations of cognitive grammar. Practical applications, Volume 2. Stanford: Stanford University Press. Lelekov T., Franck N., Dominey P.F., Georgieff. N. (2000). Cognitive sequence processing and syntactic comprehension in schizophrenia. Neuroreport. 14;11(10), 2145–9. Leslie A.M., Keeble S. (1987). Do six-month-olds perceive causality? Cognition 25, 265–288. Lœvenbruck , H., Baciu, M., Segebarth, C., & Abry, C. (2005). The left inferior frontal gyrus under focus: An fMRI study of the production of deixis via syntactic extraction and prosodic focus. Journal of Neurolinguistics 61, 237-258. Mandler J. (1996) Preverbal representation and language. In Bloom et al. (Eds.), Language and Space, 365–384. Meltzoff, A. (1995). Understanding the intentions of others: Re-enactment of intended acts by 18-month-old children. Developmental Psychology, 25, 952–962. McDonough, L., Choi S., & Mandler, J.M. (2003). Understanding spatial relations: Flexible infants, lexical adults. Cognitive Psychology 46, 229–259 Miikkulainen R. (1996). Subsymbolic case-role analysis of sentences with embedded clauses. Cognitive Science, 20, 47–73. Morgan JL, Demuth K (1996) Signal to syntax: boostrapping from speech to grammar in early acquisition. Lawrence Erlbaum, Mahwah NJ, USA. Morgan, J.L., Shi, R., & Allopenna, P. (1996) Perceptual bases of rudimentary grammatical categories: Toward a broader conceptualization of bootstrapping. In Morgan J.L., & Demuth,

257

258

Peter F. Dominey

K., Signal to syntax: Boostrapping from speech to grammar in early acquisition (pp. 263–286). Mahwah NJ: Lawrence Erlbaum. Newmeyer, F. (1998). Language form and language function. Cambridge, MA: MIT Press. Pinker, S. (1984). Language learnability and language development. Cambridge, MA: Harvard University Press Pinker, S. (1987) The bootstrapping problem in language acquisition. In B. MacWhinney (ed.), Mechanisms of language acquisition. Mahwah, NJ: Lawrence Erlbaum Associates. Pulvermüller, F. (1995). Agrammatism: Behavioral description and neurobiological explanation. J Cog Neuroscience, 7(2), 165–181. Roy, D., & Pentland, A. (2002). Learning words from sights and sounds: A computational model. Cognitive Science, 26(1), 113–146. Shi, R., Werker, J.F., & Morgan J.L. (1999). Newborn infants’ sensitivity to perceptual cues to lexical and grammatical words. Cognition, 72(2), B11-B21. Siskind J.M. (2001). Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research, 15, 31–90. Stolcke, A., & Omohundro, S.M. (1994). Inducing probablistic grammars by Bayseian model merging In Grammatical Inference and Applications: Proc. 2nd Intl. Colloq. On Grammatical Inference. New York: Springer. Talmy, L. (1988). Force dynamics in language and cognition. Cognitive Science, 10(2) 117–149. Tomasello, M. (1998). The new psychology of language: Cognitive and functional approaches. Mahwah, NJ: Erlbaum. Tomasello, M. (1999) The item-based nature of children’s early syntactic development. Trends in Cognitive Science, 4(4), 156–163 Tomasello, M. (2000). Do young children have adult syntactic competence? Cognition 74, 209– 253 Tomasello, M. (2003). Constructing a language: A usage-based theory of language acquisition. Cambridge, MA: Harvard University Press. Woodward, A.L. (1998). Infants selectively encode the goal object of an actor’s reach. Cognition, 69, 1–34.

About the author Peter Ford Dominey earned the BA at Cornell University in Cognitive Psychology and Artificial Intelligence in 1984. In 1989 and 1993 respectively he obtained the MSc and PhD in Computational Neuroscience from the University of Southern California, Los Angeles. From 1984 to 1986 he was a Software Engineer at The Data General Corporation and from 1986 to 1993 he was a Systems Engineer at the Jet Propulsion Laboratory in Pasadena, CA. From 1993 to 1997 he was a post-doctoral fellow at INSERM U94 in Lyon France, and in 1997 he became a tenured researcher in the CNRS. His research addresses understanding and simulating the neurophysiology of cognitive sequence processing and language, and their application to robot cognition.

Aspects of descriptive, referential, and information structure in phrasal semantics

Appendix: Sentence type data base Sentences in the experiments used the nouns “block, moon, cylinder, dog, cat, ball” and verbs “touch, push, take, give, said” Closed class words included “that, to, from, by, was, it, itself, and “. The different grammatical structures represent a subset selection of the possible constructions for simple and dual sentences in English. The model does not currently account in a generalized way for verb tense morphology. The current sentences were all in the past tense. Simple Event Sentences 1. Agent action object. (Active) E.g. John pushed the block. Action(agent, object) e.g. push(John, block). 2. Object was actioned by agent. (Passive) E.g. The block was pushed by John. Action(agent, object), e.g., push(John, block). 3. Agent actioned object to recipient. (Dative) E.g., John gave the block to Bill. Action(agent, object, recipient), e.g., gave(John, block, Bill) 4. Object was actioned to recipient by agent. (Dative passive) E.g. The block was pushed to Bill by John. Action(agent, object, recipient), e.g., pushed (John, block, Bill). 5. Agent action recipient object. E.g., John gave Bill the block. Action(agent, object, recipient), e.g., gave(John, block, Bill). Double event relatives 6. Agent1 that action1ed object2 action2ed object3. (Relative agent). Action1(agent1,object2), Action2(agent1,object3) 7. Object3 was action2ed by agent1 that action1ed object2. Action1(agent1,object2), Action2(agent1,object3) 8. Agent1 that action1ed object2 was action2ed by agent3 Action1(agent1,object2), Action2(agent3,object1) 9. Agent3 action2ed object1 that action1ed object2 Action1(agent1,object2), Action2(agent3,object1) 10. Object2 that was action1ed by agent1 action2ed object3 Action1(agent1,object2), Action2(agent2,object3) 11. Object3 was action2ed by agent2 that was action1ed by agent1 Action1(agent1,object2), Action2(agent2,object3) 12. Object2 that was action1ed by agent1 was action2ed by agent3 Action1(agent1,object2), Action2(agent3,object2) 13. Agent3 action2ed object2 that was action1ed by agent1 Action1(agent1,object2), Action2(agent3,object2) 14. Object3 was action2ed to recipient4 by agent1 that action1ed object2 Action1(agent1,object2), Action2(agent1,object3,recipient4) 15. Agent1 that action1ed object2 was action2ed to recipient4 by agent3 Action1(agent1,object2), Action2(agent3,object1,recipient4) 16. Agent3 action32ed object4 to recipient1 that action21ed object2 Action1(agent1,object2), Action2(agent3,object4,recipient1) 17. Object4 was action32ed from agent3 to recipient1 that action21ed object2 Action1(agent1,object2), Action2(agent3,object4,recipient1) 18. Object2 that was action1ed by agent1 action2ed object3 to recipient4

259

260

Peter F. Dominey

Action1(agent1,object2), Action2(agent2,object3,recipient4) 19. Agent3 action2ed object4 to recipient2 that was action1ed by agent1 Action1(agent1,object2), Action2(agent3,object4,recipient2) 20. Agent1 that action1ed object2 to recipient3 action2ed object4 Action1(agent1,object2,recipient3), Action2(agent1,object4) 21. Object4 was action2ed by agent1 that action1ed object2 to recipient3 Action31(agent1,object2,recipient3), Action22(agent1,object4) 22. Agent4 action2ed object1 that action1ed object2 to recipient3 Action1(agent1,object2,recipient3), Action2(agent4,object1) 23. Object1 that action1ed object2 to recipient3 was action2ed by agent4 Action1(agent1,object2,recipient3), Action2(agent4,object1) 24. Agent2 that was action1ed by agent1 to recipient3 action2ed object4 Action1(agent1,object2,recipient3), Action2(agent2,object4) 25. Agent4 action2ed object2 that was action1ed by agent1 to recipient3 Action31(agent1,object2,recipient3), Action22(agent4,object2) 26. Ag1 that act1ed obj2 act2ed obj3 to recip4 Action1(agent1,object2), Action2(agent1,object3,recipient4) Dual event Conjoined 27. Agent1 action1 object1 and object2. (Active conjoined object) Action1(agent1, object1), Action1(agent1, object2) 28. Agent1 and agent3 action1ed object2. (Active conjoined agent) Action1(agent1, object2), Action1(agent3, object2) 29. Agent1 action1ed object2 and action2 object3. (Conjoined) Action1(agent1, object2), Action2(agent1, object3) Dual event Reflexive (Note that action1r corresponds to reflexive action such as “see” or “think”.) 30. Agent1 action1r that agent2 action2ed object3. (Simple reflexive) Action1r(agent1), Action2(agent2, object3). 31. Agent1 action1ed itself. (Simple active reflexive) Action1(agent1, agent1). 32. Agent1 action1r that agent2 action2ed itself. (Reflexive simple noun phrase). Action1r(agent1), Action2(agent2, agent2). 33. Agent1 action1r that agent2 action2ed it. (Pronoun simple noun phrase). Action1r(agent1), Action2(agent2, agent1). 34. Agent1 action1r that it action1ed object2. Action1r(agent1), Action2(agent1, object2). 35. Agent1 action1r that object3 was action2ed by agent2. Action1r(agent1), Action2(agent2, object3). 36. Agent1 action1r that agent2 action2ed object3 to recipient4. Action1r(agent1), Action2(agent2, object3, recipient4). 37. Agent1 action1r agent2 action2ed object3 to recipient4. Action1r(agent1), Action2(agent2, object3, recipient4). 38. Object2 object3 were action1ed to recipient4 by agent1. Action1(agent1, object2, recipient4), Action1(agent1, object3, recipient4)

First in, last out? The evolution of aphasic lexical speech automatisms to agrammatism and the evolution of human communication Chris Code University of Exeter/University of Sydney

Current work in the evolution of language and communication is emphasising a close relationship between the evolution of speech, language and gestural action. This paper briefly explores some evolutionary implications of a range of closely related impairments of speech, language and gesture arising from left frontal brain lesions. I discuss aphasic lexical speech automatisms (LSAs) and their resolution with some recovery into agrammatism with apraxia of speech, an impairment of speech planning and programming. I focus attention on the most common forms of LSAs, expletives and the pronoun+modal/aux subtype, and propose that further research into these phenomena can contribute to the debate. I briefly discuss recent studies of progressively degenerating neurological conditions resulting in progressive patterns of cognitive impairments that promises to provide insight into the evolutionary relationships between speech, language and gesture. Keywords: evolution of language and speech, aphasic speech automatisms, recurrent utterances, agrammatism, expletives

1. Introduction Beginning with the pioneering work of Broca, Wernicke, Hughlings Jackson and others in the latter part of the 19th century, the study of brain damage formed the basis for most of our knowledge of the relationships between brain structure and cognitive function. This ‘lesion model’ approach has produced a rich literature (see Code, Wallesch, Joanette, & Lecours, 1996, 2003), and in tandem with developments in cognitive neuroscience, continues to make a significant contribution. In

262

Chris Code

this paper I examine some limited aspects of aphasic symptomatology from the perspective that they may represent fossilised clues to the emergence of human language. My aim in this paper is to briefly explore some evolutionary implications of a limited range of closely related neuropsychological impairments of language and speech-action production arising from left frontal brain lesions, and sketch some possible relations between them. I examine aphasic lexical speech automatisms (LSAs), commonly occurring in aphasia from frontal lesions and often evolving into agrammatism (the term used to describe impairments in the use of syntax with articulation implementation impairments), and explore some evolutionary implications of this relationship. My focus will be on relationships between speech, syntax and gesture, and therefore less on a separate treatment of phonetic, phonological, syntactic and gestural components. Aphasia is the generic term we use to describe a range of impairments to language use following brain damage. Some use the term to describe most impairments to any aspect of language use, including right hemisphere language impairments and apraxia of speech, and others emphasise the interaction of language processing with other aspects of cognition, most notably movement, perception and memory processing. Still others prefer to reserve the term for describing impairments to combinatory aspects of language processing — syntax, phonology, morphology, lexical semantics. There are acknowledged problems with large group research based on standardised typologies of aphasia (Basso, Lecours, Moraschini & Vanier, 1985; Dronkers, 2000), where up to 30% of aphasic speakers classified by type fail to show lesions in the predicted areas of the brain (i.e., a lesion in Broca’s area without a Broca’s aphasia, and vice versa). Much of the recent research into the cognitive neuroscience of language has been informed by models developed from detailed investigations of single case series. Before proceeding, I outline some contemporary theoretical approaches to the emergence in evolution of human communication that form the basic theoretical foundations for this paper. Clearly there is a major supposition in language evolution research that normal development of language and communication in humans and other primates can serve as models to generate hypotheses and develop theory and that ‘ontogeny recapitulates phylogeny’. My equally challengeable axiom is that impairments of speech, language and gesture following brain damage can provide additional insights and questions for research, and can converge with data from these other approaches. I accept as a starting point that modern human language differs significantly from animal communication in so far as it has developed a syntax that is generative and recursive (Hauser, Chomsky and Fitch, 2002). I leave to others the question of whether non human primates possess in the wild, or are able to learn, develop or

First in, last out?

use in captivity, a referential capacity and/or a capacity to propositionalise. I accept too that language and speech share cognitive and neural spaces with action/gesture and perception representations (e.g., Martin, Ungerleider & Haxby, 2000, for review), and an evolutionary relationship, but also leave for others discussion of opposing theories of whether language evolved first from manual gestures or first from animal vocalisation. This discussion can be followed elsewhere (Corballis, 2002; Arbib, 2005). I accept too that modern human language may have been preceded in its emergence by some form of protolanguage and protospeech. Protolanguage is seen as a stage preceding the development of full syntax (Bickerton, 1990). Bickerton (1990) argues that infant language, pidgin languages, and the language taught to apes in captivity are all protolanguages made up of utterances comprising a few words without syntactic structure beyond word order. Language is more than syntax, morphology and phonology, and I acknowledge with others that humans enjoy a capacity for formulaic language use that constitutes an essential feature of human language that has to be taken into account in any examination of the evolution of language and communication (Wray, 2002; Van Lancker-Sidtis, 2004; Code, 1987). In this I, with others, disagree with the simple dismissal of formulaic language as ‘deeply, fundamentally, the wrong way to think about how human language works’ (Pinker, 1994; p.93) and that ‘virtually every sentence that a person utters or understands is a brand-new combination of words, appearing for the first time in the history of the universe.’ (Pinker, 1994; p. 22). This flies in the face of the fact that, depending on context, significant amounts of the language human beings produce is automatic, formulaic, over-learnt, routine, holistic and pre-packaged and cannot simply be dismissed or ignored (Wray, 2002; Van Lancker Sidtis, 2004; Code, 1987).

2. Related frontal impairments of speech, language and action Recent work in neuroscience (Corballis, 2002; Arbib, 2005; Martin, Ungerleider & Huxby, 2000), cognitive psychology (Glenberg & Kaschak, 2002, 2003) and comparative animal research (e.g., Tomasello & Call, 1997) emphasises the close relationships between speech, language and gesture/action to the extent that language is claimed to be grounded in action and perception (Glenberg & Kaschak, 2002, 2003; Martin et al., 2000; Feldman & Narayanan, 2004) and the emergence of articulate speech, voice, action/gesture and lateralisation are seen to be closely related in evolution (Corballis, 2003; Arbib, 2005; Greenfield, 1991; Kimura, 1976).

263

264

Chris Code

The pioneering linguist Roman Jakobson (1896–1982) was one of the first to apply linguistic principles to theoretical modelling in aphasia. A particular interest was in the relationships between developing child speech and the speech of people with various forms of aphasia, and proposed a ‘regression hypothesis’, which holds that we can observe the same processes in both the child’s development of speech and in the impairments of aphasic speakers, but in reverse. ‘The dissolution of the linguistic sound system in aphasics provides an exact mirror-image of the phonological development in child language’ (Jakobson, 1968, p.60) and ‘the order in which speech sounds are restored in the aphasic during the process of recovery corresponds directly to the development of child language’ (Jakobson, 1968, p.64). Jakobson’s hypothesis has attracted little interest in neurolinguistics in recent years. MacNeilage (1998) has proposed the Frame/Content theory to explain the evolution of speech from the close-open cycle of the mandible that originally evolved for mammalian ingestive functions — chewing, sucking, licking. This basic mandibular cycle, which underlies the speech frame, was exapted and paired eventually with vocalization to become the basis for CV syllables. More recently, MacNeilage and Davis (2001) have confirmed the major similarities between the development of the CV syllabic characteristics of babbling in children and the nonlexical speech automatisms (NLSAs) made up of recurring CV syllable utterances (e.g., /tu tu/, /du du/) that are also commonly observed in severe ‘motor’ types of aphasia (Code, 1982a). In what follows I want to explore the idea that while the first utterances that were spoken may likely have been basic CV syllables, as mirrored in babbling and in aphasic recurring CV automatisms, the evolution of lexical speech automatisms (LSAs) to agrammatism, also observed commonly in severe aphasia resulting from left frontal damage, may also provide useful insights into the early evolution of language. 2.1 Aphasic speech automatisms LSAs are made up of real words, and their features are characterised best by models developed to account for formulaic language use. The major properties of formulaic utterances are summarised in Table 1 (based on Van Lancker-Sidtis, 2004). These are the features that separate formulaic language from novel, componentially generated and fully propositional language. Such formulaic language as clichés, over-used social greetings, expletives, stock phrases, automatic speech like counting and serial recitation of rhymes, exhibit the features listed in Table 1.

First in, last out?

Table 1. The main properties of formulaic language (based on Van Lancker Sidtis, 2004) and some features of aphasic speech automatisms discussed later in the text. Property of Formulaic Language

LSA

NLSA

Stereotyped form Conventionalised meaning Associated with social context

– � �

– � �

Inclusion of attitudinal and affective valance Familiarity — recognised by other native speakers

– –

– �

It was Hughlings Jackson’s (1874, 1879) original observations of aphasic speech automatisms in the later nineteenth century that led him to propose the idea of propositionality in language. Nonpropositional speech is produced automatically and the individual syntactic, morphological and phonological elements are not newly or individually generated. It includes cursing and swearing, automatic serial verbal activities like automatic counting and rhyme, prayer and arithmetic table recitation. He distinguished this from propositional speech where original ideas are being encoded into novel utterances. Jackson (1874) also introduced the idea that the left hemisphere was responsible for processing propositional language, whereas both right and left were involved in the processing of nonpropostional language. Between 15–30% of spoken language may be predominantly formulaic in certain social contexts, like when using the telephone (Van Lancker-Sidtis, 2004). It would be a mistake to view such language as devoid of meaning and significance; even automatically produced expressions of pain and emotion are high in expressive significance. In addition, formulaic communications have a major pragmatic and social function (Wray, 2002). Although relatively automatically produced, therefore, the degree of ‘propositionality’ inherent in automatically produced formulaic language is highly variable and situation-specific. I ask the reader to consider the proposition, along with Wray (1998), that formulaic language functions today more or less as it did during a protolanguage stage in evolution. The interested reader can refer to recent discussions on the place of formulaic language in contemporary language use and in evolution in Wray (2002) and Van LanckerSidtis (2004). LSAs are made up of the same recognisable high frequency words produced every time, or almost every time, speech is attempted. They are common in speakers with a severe Broca’s pattern of aphasia with significant speech initiation impairments (although the LSA itself is produced relatively fluently). Cases have been described with some retained expressive writing abilities (Blanken, De Langen, Dittmann & Wallesch, 1989). LSAs are stereotyped and unchanging utterances, invariantly produced, being phonologically, syntactically and semantically

265

266

Chris Code

Table 2. Shows some examples of LSAs classified into subtypes taken from an analysis of 78 LSAs (Code, 1982a) Pronoun+Modal/ Aux Verb N=14 Examples: I want…(N=3), I can/can’t…(N=5), You can’t…, I try…, I think…,

Expletives

Proper Names

Yes/No

N=11

N=5

N=4

fucking hell, fuck off, fuck fuck fuck, ( fuck) off, bloody hell, Oh you bugger, Bugger

Bill, Billy, John. Parrot (proper name), Percy’s died, BBC

yes yes yes, no (N=2) yep

identical each time they are produced. Suprasegmentals, like intonation and stress, can vary considerably (De Bleser & Poeck, 1985; Oelschlaeger & Damico, (1998), although the range and pattern of intonation production too may be limited for many speakers. The overwhelming majority of LSAs (the exception being proper names) appear to have no referential, contextual or intentional connection with the speaker’s world. An LSA is triggered by intentional expression or internal states, even if their surface structure has no semantic relationship to intention, meaning and the individual’s internal state. Sometimes speakers have more than one automatism, a second or third emerging after the first. These subsequent ones are usually linguistically related (e.g., 1st: so so 2nd: better better; 1st: I can talk 2nd: I can try) (Code, 1982a). Many speakers with aphasic speech automatisms may be unaware of the inappropriateness of the utterance (Alajouanine, 1956), although this has not been systematically investigated. In contrast, nonlexical speech automatisms (NLSA) do not make up recognisable words and consist predominantly of high frequency, motorically ‘easier’ reiterated consonant+vowel syllables. They are not arbitrary combinations of phones but adhere to the phonotactic constraints of the speaker’s native language. These too are unchanging, although a speaker may have more than one; but subsequently emerging NLSAs are also usually phonologically similar (e.g., 1st: tu tu, 2nd: du du) (Code, 1982a). Features common to both include the holistically produced, pre-packed nature. The phones which make up both types are high frequency, unmarked, motorically easier, articulations; coronals (e.g. [d,n,t]) predominate and there’s a double dissociation between them in the sense that both forms rarely cooccur in an individual aphasic speaker.

First in, last out?

Speakers with CV syllable NLSAs are very severely aphasic and recovery to another aphasic stage is rare, although there have been few systematic studies (Alajouanine, 1956; Code, 1982a). This might suggest that while NLSAs provide a possible basis for comparison with babbling and the evolution of syllabic structure, they are less useful as a model for getting to subsequent stages in the evolution of language. LSAs, in contrast, often evolve to agrammatism and analyses suggest that most look like the formulaic constructions that make up so much normal modern human language (Code, 1994). The semantic range of LSAs is narrow. There are personal names (known to the aphasic speaker), counting sequences, yes/no utterances and — the largest groups — expletives and modal/auxiliary structures, and these utterances might represent fossilised clues to the origins of human communication. Table 1 suggests that while both LSAs and NLSAs have a stereotyped form and can be produced with significant attitudinal and/or affective valance, their linguistic surface structure carries no conventionalised meaning and the utterance has no semantic connection to the situational context. While the ‘meaning’ of a NLSA is not recognised by native speakers, as it has no surface lexical structure, LSAs do have a surface lexical-semantic structure. But LSAs do not function as formulaic ‘communication’ in the normal sense; they are invariantly produced in place of appropriate language. However, as noted, aphasic speakers with either form of utterance can signal different social and expressive meanings through their use of accompanying intonation. But the surface structure carries no meaning. It is particularly important to distinguish between the initial production of a speech automatism — the first time the speaker produces the utterance following their neural incident, and subsequent productions of the same utterance when the speaker intends to produce something else. Some authors have made claims about the origins of speech automatisms, speculating on some relationship between something being said or about to be said at the moment of the cerebral incident and the resulting LSA. The evidence for this claim is weak and mostly anecdotal. The very narrow semantic range of most LSAs would suggest that either the speakers were all contemplating similar utterances at the time of their individual strokes, an unlikely scenario, despite suggestions to the contrary (see Code, 1982b, for alternative explanations). It seems unlikely, therefore, that most of these utterances were intended even the first time the individual uttered them. We consider neural explanations for the origins of LSAs in the next section. Speakers with LSAs can make significant recovery, usually to nonfluent agrammatism, whereas speakers with NLSA are more severely impaired and many may not be aware of their utterance (Alajouanine, 1956). The NLSA is perhaps the most primitive phonetically governed utterance a modern adult human can produce,

267

268

Chris Code

representing a very early evolutionarily speech capability. I have suggested elsewhere (Code, 1994) that ‘much nonpropositional language may therefore be seen as evolutionary pre-linguistic. It is concerned with social and emotional aspects of communication and expression which pre-exist the capacity in human beings to generate fully predicative propositional language.’ For the severest of aphasic speakers who have them, this is most commonly all they have. It’s a long way from Broca’s first case Leborgne with the NLSA ‘Tan’ (/tã tã/) to the language of Shakespeare, Goethe, or Dylan (Thomas or Bob). 2.1.1.1. Neurogenesis of aphasic speech automatisms This is no place to review yet again the extent and nature of right hemisphere language and speech. The interested reader is referred to Code (1987) and Joanette, Goulet and Hannequin, (1990) for reviews on the role of the right hemisphere in language in general and Code, (1996, 1997) and Code and Joanette (2003) for discussion of the right hemisphere’s speech production capability. Extensive evidence has accumulated in recent years that supports Jackson’s notion that the right hemisphere has at least some involvement in the processing of automatic and nonpropositional/formulaic language. Prosody, emotional language, automatic language, idioms, metaphors, and other complex features that do not engage combinatorial linguistic processes, appear to be processed with significant right hemisphere involvement. Van Lancker and Cummings (1999) have comprehensively reviewed the neural origins of expletives in pathological language. ‘Naturally’ occurring expletives appear to emerge from ancient areas of the limbic lobe (a lobe originally named by Broca) and the basal ganglia (see also Lamendella, 1977; Leckman, Knorr, Rasmussen & Cohen, 1991; Code, 1987; Speedie, Wertman, T’air & Heilman, 1993) as complete packages and do not engage linguistic processes. They appear to emerge in pathology from disinhibited limbic structures, a system normally under more control from basal ganglia-prefrontal networks. Emotionally charged communication, like laughter and crying, the communication shared between lovers and between a mother and her baby, appear to be mediated with significant limbic involvement (MacLean, 1987). As part of the limbic system, the anterior cingulate, with its role in vocal initiation and emotion (Benga, this issue), and its close approximation and connection to the supplementary motor area, takes a special role. A striking feature of Tourette’s syndrome, a disorder resulting from basal ganglia-limbic connection dysfunction, is coprolalia (the involuntary production of obscene speech) examples being cunt, fuck. The speaker with Tourette’s appears to have no or minimal control over these emissions. Basal ganglia damage appears to be essential for the production of an aphasic speech automatism. Brunner, Kornhuber, Seemuller,

First in, last out?

Suger and Wallesch (1982) analysed the CT scans of 26 patients, 12 of whom had either a NLSA or a LSA. All 12 had basal ganglia damage, but neither type of utterance occurred in patients without basal ganglia damage and automatisms did not occur in patients who had only subcortical (including basal ganglia) damage. Code (2005) has recently reviewed the contribution of the supplementary motor area (SMA) to syllable frame production from the perspective of brain damage, and MacNeilage and Davis (2005) have recently summarised the electrical stimulation work that demonstrates that stimulation of the SMA causes an evocation and, importantly, repetition of basic meaningless CV forms, something that doesn’t happen with stimulation of other neural sites, including Broca’s area. Abry et al. (2002) propose that the SMA produces the utterance, suggesting the term frame aphasia for impairments resulting in NLSA, supporting MacNeilage’s claim that it is the SMA that is responsible for producing the syllabic frame and the inferior frontal area for the syllabic content of the syllabic structure of speech. The production of a syllabic frame is one of the things that SMA does and the mechanism involved in inhibiting the repetition of the same frame, and the ‘moving on’ to the next frame, is damaged or disconnected from SMA. Action schemata (Bradshaw, 2001) are represented in the frontal action system entailing input from the three basic cognitive functions of short-term motor and perceptual memory, and inhibitory control. Separate internally generated and externally triggered action systems have been identified (Jahanshahi & Frith, 1998; Bradshaw, 2001). For internally generated, voluntary and self-initiated actions the dorsolateral prefrontal cortex, anterior cingulate, SMA, putamen and thalamus in the basal ganglia are engaged, whereas for externally triggered actions the lateral premotor and inferior parietal areas combine with anterior cingulate, but SMA is not involved. There would appear to be an SMA-anterior cingulate-basal system responsible for initiation and ‘moving on’ for speech and voice and an inferior frontal-Broca’s-operculum-premotor-basal system with parietal input responsible for syntax and sequential gestural communication (Corballis, 2002, Arbib, 2005). LSAs and NLSAs may represent, therefore, a pre-Broca’s stage in evolution. LSAs appear to come from frontal right hemispheric systems and NLSAs from the left SMA deprived of inhibition (Abry et al., 2002; Code, 1994, 2005). My assumption is that much of protolanguage is non-lateralised, and the protolanguage stage in evolution took place before the establishment of lateralisation of function. Current views (Corballis, 2003) suggest that speech, gesture, dominant handedness and lateralisation are inseparably linked and, perhaps, approximately co-evolved. While there are clear and present dangers in lumping all formulaic communication into the same general basket, the evidence suggests that those aspects of current human language use, which, it is suggested, may represent fossils

269

270

Chris Code

of an evolutionary protolanguage, have significant right hemisphere represented (Code, 1987; Van Lancker-Sidtis, 2004; Wray, 2002). Formulaic language never got lateralised. It is propositional speech production that did, along with action and gesture control, supporting a putative sequential motor grounding for the development of a protosyntax leading to the emergence of narrow syntax with recursion. The evolution of an LSA to agrammatism for some, and the ‘nonfluent’ nature of agrammatism, would support this view. The high co-occurrence of limb and orofacial apraxia (impairments in voluntary use of actions, tools and gesture) with aphasia (see Code, 1998; Duffy, 2006) underscores the link between speech, action and syntax. But what evidence is there that lexical speech automatisms might be fossilised formulaic components of protolanguage? Could they represent some of the earliest utterances that humans produced? If they do, how might they have functioned communicatively? Maybe as such nonpropositional utterances do now. In their pathological form LSAs carry no meaning whereas in evolution such utterances will have carried meaning. LSAs are invariantly produced, as such utterances would have been in their evolutionary protolanguage manifestation. Such utterances in evolution would communicate internal states and speech acts, which may have functioned to manipulate others and in building and maintaining social relationships (Wray, 2000). In evolution too such utterances will have been used with intentionality. Below we look at the main subtypes, expletives and pronoun+modal/aux, in a little more detail. 2.1.1 Expletives The use of expletives and dirty words for cursing and swearing is universal, is clearly differentially taboo in different cultures (see Leach, 1966), appears early in children, and is more common in males (Jay, 1980, 1995; Van Lancker & Cummings, 1999): all the expletive LSAs in Code (1982a) were produced by men. They are used negatively (hatred and racism) and positively (humour and sex). Jay (1995) provides an example of the use of expletives in humour that nicely captures a common perception of the New Yorker that is worth repeating: ‘How many New Yorkers does it take to change a light bulb? Answer, None of your fucking business.’ Expletives have a range of functions in normal language use. These expressions can emerge reactively from us following a crack on the shin from the coffee table or the realisation that you’ve just missed the train, and are intermingled often with expressions of pain and frustration. There is a clear functional difference, for instance, in the utterance of fuck as an expression of pain and anger following a crack on the shin and fuck used under the sheets, as dirty talk, although both events involve powerful emotional forces. They can function as fundamental expressions

First in, last out?

of deep emotional feeling (Jay, 1995), or just habitually where terms are used in a fairly innocent adjectival manner, usually for emphasis; e.g., The bloody car was going fucking fast. The aphasic speaker with an expletive LSA does not intend the utterance in either of these senses. Expletives take a direct route, by-passing morphophonological and syntactic processes, and in pre-language evolution they would not have been used recursively. Anthropological perspectives (e.g., Leach, 1966) highlight expletive use as taboo and tie it closely to the emergence of religion in human societies: expletives become ‘bad language’ in a religious context. Notwithstanding, the use of expletives would predate the development of religious sensibilities in humans. Before this bad language was not bad. However, the evolution of the ability to inhibit the use of expletives, consciously and neurally, something humans employ to varying degrees in different social and behavioural contexts, would appear to be linked to the development of religious taboo. Darwin points out that emotions expressed in animals are usually variants of lust and hostility (Darwin, 1872; Walker, 1987). They may have been the first verbal threats and intimidations uttered by humans and may reflect the vocal elements that may have occurred first as emotional accompaniments to gestural communication (Corballis, 2003); for instance, the angry utterance of a threatening expletive to a potential competitor for a carcass. Expletives may have accompanied violent and sexual acts, as they can now, and blends of sex and violence. Expletives can be included in slots in normal sentence production, of course, taking verb, noun and adjectival forms. They may therefore be accessible by the left hemisphere lexicon or may have separate representation there (for discussion see Wray, 2002). The role, or failure, of inhibition in expletive LSAs is illustrated by the observation that they can occur in people who have not been in the habit of using expletives in their normal everyday speech (one person with the LSA fuck in the collection under discussion was a retired Christian Minister). The LSA ‘off in Table 2 was originally fuck off: the offensive element was gradually removed through significant hard work from the speaker’s wife, leaving the speaker with (what looks like) a rare ‘function’ word in the collection. For these utterances used naturally and spontaneously, a strong emotional surge would appear to be essential. They can function to arouse a sexual partner, and to self-arouse. Dirty-talk — the exchange of speech that accompanies sexual intercourse — may have been amongst the first verbalisations used by humans. Sexually expressive utterances may have been amongst the earliest language used by humans, predating protolanguage perhaps and with a minimal syllabic structure, and may well have developed from primitive grunts.

271

272

Chris Code

The use of expletives in earnest strikes the ear as foul and base, especially when associated with rowdy or violent behaviour, often evoking feelings of fear and anger in the hearer. In natural use, therefore, expletives are powerfully produced and have powerful effects on others. 2.1.2 Pronoun+Modal/Aux The pronoun/modal/aux verb subtype is the largest (N=14) in Table 2. They appear as half-formed or unfinished sentences, possibly because of lexical access problems at the first emission of the utterance. They contrast interestingly with features of agrammatism (discussed further below), into which they can evolve, as agrammatic speakers typically omit pronouns, modals and auxiliaries in their speech (Nespoulous, Code, Verbil and Lecours, 1998; Perlman Lorch, 1989). The apparent formulaicity of these automatisms arises from their frozen and invariant production in all contexts. It is possible that, just as people with aphasia and LSAs often evolve into severely agrammatic speakers (Alajouanine, 1956), so in evolution such utterances might have emerged during an intermediate stage; part of a protosyntactic stage (Jackendoff, 1999) between protolanguage and ‘narrow’ syntax in Hauser et al’s (2002) sense. Modals and auxiliaries like I want, I can’t… may correspond to the stage where pronouns were coupled with verbs for the first time and may represent the first struggling steps in the emergence of syntax. Bickerton (1998) proposes that a theta-role structure would have been necessary to reliably sequence two or more words, which he sees as a possible link between protosyntax and narrow syntax. It is relevant to note that an over dependence on modalising language is characteristic of most forms of aphasia where referential language is impaired or absent, whereas the reverse appears to be the case in agrammatism (Nespoulous et al., 1998). A small recent study of just 20 LSAs in Cantonese speakers suggests that this subtype is not a universal (this structure not being present in natural Cantonese speech), while the other subtypes of LSAs found in European languages were also found in Cantonese (Chung, Code & Ball, 2004). The first person pronoun I was the most common word, occurring 13 times in 78 words (Code, 1982a). A number of speakers had identical LSAs. There were 3 examples of I want… and 5 examples of I can… or I can’t… The probability of this occurring by chance is tiny, even if the expressions are relatively high frequency in spoken language, and serves to emphasise the reduced ‘semantic’ range available to the speaker. It is also a major argument against the idea that LSAs can be traced back to something being said, or about to be said, at the moment of the cerebral incident (see Code, 1982b for discussion).

First in, last out?

Does an account of a possible role of speech automatisms in evolution of communication converge with data in normal development? Pre-linguistic infants produce emotional utterances, expressions of pain, anger and disappointment very early. They later go on to develop a stereotypic use of language to manipulate their environment — ‘I want…’, ‘I can’t’. The development of pronouns in children is well documented. Clark & Clark (1977) suggests that they emerge in development in the order 1st, personal pronouns ‘I, me, my, mine’; 2nd, ‘you, your, yours’; 3rd, ‘he, him, his, she, her, hers, it, its’, followed by plurals ‘they, them, theirs’. Brown’s (1973) seminal study suggests that the first two stages described by Clark occur between 1;6 and 2;0 years. The link between the emergence of ‘I’ and ‘me’ and the concept of self has been made many times since William James (1890) and is illustrated by the notion of ‘I’-Thoughts (Bermudez, 1998), the ability to entertain thoughts about oneself. The 2 year old child often confuses their use of pronouns indicating a lack of integrity between different self-concepts (Piaget, 1954). The use of the 1st person pronoun, therefore, requires some sense of conscious self-awareness. Correlated with strengthened connections between the cortex and limbic system in the first year of life, is the emergence of universal fears of strangers and separation from the carer. Additional neural strengthening in the second year correlates with the emergence of self-awareness and the eruption of language (Herschkowitz, 2000). If the emergence of ‘I’ in infants is taken to reflect a stage where the infant’s self-awareness and consciousness is developing, emergence of ‘I’ in evolution may have indicated the same — as humans began to be able to acknowledge them ‘selves’, they developed, probably in tandem, words to denote themselves. With, or following closely on, a sense of self must come a sense of others (‘you’, and maybe later ‘us’ and ‘they’). A sense of conscious self would therefore have predated a ‘theory of mind’, which might be conceived as necessary for the application of personal names to emerge later. So maybe conscious self-awareness developed in humans before full syntax. The view that LSA speakers are unaware of their speech has never been tested to my knowledge and clinical experience suggests that some may be more aware than others. Level of awareness may be a prognostic indicator. However, recent studies show that severely agrammatic and globally aphasic people still have self awareness and self consciousness and are able to complete theory of mind tasks (Varley & Siegal, 2000; Varley, Siegal & Want, 2001), challenging the idea that consciousness is dependent on language. Wray (1998) has recently suggested a different view of the emergence of pronouns during the protolanguage stage, functioning as a general-purpose referent for humans living in the here and now (e.g., give her the meat; give her the stone). Questioning whether personal names existed at the same stage, she extends the

273

274

Chris Code

idea to suggest that there may have been a stage where even more general purpose pronouns like this or that would cover the entire class of objects in the environment, something like give her that and give her this would achieve a desired effect without the need for further qualification.

3. Discussion: From automatisms to syntax in a frontal speech-action system Research in recent years has added to the classical language areas and most of us have a broader appreciation of the nature of language that does not fit a narrow left-cortical-perisylvian model. The right hemisphere, basal ganglia, thalamus and limbic structures play significant parts in the processing of formulaic, nonpropositional, metaphorical, idiomatic and pragmatic aspects of language. Basal-limbic structures are phylogenically old (Bradshaw, 2001; Van Lancker & Cummings, 1999; Walker, 1987) and the aspects of human communication associated with them would appear to be old too. Wray (2000) suggests a process whereby protolanguage was made up of ‘phonetically arbitrary’, holistically produced, formulaic, CV sequences, such as /mupati/, to mean give that to me. These functioned as interpersonal manipulations and expressions of group and personal identity. She emphasises that ‘the whole thing means the whole thing’; they are holistic structures. She goes on to develop arguments for this providing an eventual basis for segmentation into separate ‘words’, based on Bickerton’s (1998) idea that a theta-role module developed following a single word stage to keep track of relations between agents and actors and events. As noted above, this may have become the basis for the emergence of syntax. However, a phonetically much simpler stage must have preceded this sophisticated concatenation of speech sounds. The motoric and phonoarticulatory constraints of an earlier stage in the development of protospeech would suggest first a single CV utterance stage, perhaps used to manipulate and express identity at an earlier and more fundamental level, much closer to primate communication and before control mechanisms centred on basal-cingulate-SMA development — an angry extended or repeated CV coupled with an outstretched limb to signal give that to me, or the ‘coo’ /kuku/ combined with a hug and grooming to express group cohesion or cement relationships. It seems possible that sounds accompanying sexual behaviour evolved into the dirty talk we now enjoy by first combining a consonantal element with a prolonged vocal component, perhaps repeatedly — CVCVCV. Repetition, a central feature of automatism production (Abry et al., 2002), then acts to cement together separate single CV syllable utterances

First in, last out?

into a CVCV… utterance. The observation that the most common sexually connotative terms (Jay, 1995) — the ‘four letter’ words — is CVC in English, suggests that most expletives are phonetically relatively simple. The combination of young humans entering puberty with still developing nervous systems under hormonal influence, might well have seen the emergence of some of the first word-like utterances. (Swearing in modern adolescence is high [Jay, 1982]). The expletive LSA for someone with major neural damage and global aphasia supports the notion that expletives have deep representation in the brain outside the peri-sylvian language area, so a capacity to express expletives would appear to be ancient. I am suggesting early CV repetition from an SMA-cingulate connection could have merged with strong emotional expressions associated with sex, love, anger and fear. Base and primitive language from the more primitive areas of the brain could be a candidate for a stage between primate communication and protolanguage and may have seen the emergence of early words. Getting to a putative protosyntax stage from expletive use may well lead to a dead end (and Wray’s [2002; p.249] model of the relationships between formulaic and non formulaic language use suggests a separate route for expletives without access to grammatical processes), but I am unaware of any investigations of the evolution of expletive LSAs to other symptoms that could throw light on the question. We have touched on agrammatism, the range of impairments that can occur in the use of grammar coupled with nonfluency caused by deficits in articulatory implementation, subsumed under the term ‘apraxia of speech’. The apparent relationship between speech, gesture and syntax suggested by the high co-occurrence of aphasia, apraxia of speech and other apraxias already noted (for review see Code, 1998; Duffy, 2006) would tend to support a theory that claims a motor/perceptual grounding for language representation (Feldman & Narayanan, 2004; Glenberg & Kaschak, 2002, 2003; Martin et al., 2000), and converges with evidence for a common genetic base for motor immaturity and specific language impairment in children (Bishop, 2002) and evolutionary bonds between them (e.g., Arbib, 2005: Corballis, 2002; Greenfield, 1991). Despite significant neurolinguistic interest and a considerable research database (e.g., Caplan, 1987; Grodzinsky, 2000), few evolutionary linguists have seen agrammatism as worthy of more intensive investigation. This may be for a variety of reasons, not least the variable presentation of agrammatism (e.g., Caplan, 1987; Webster, Franklin & Howard, 2004) and its diverse manifestation in different languages. We use the term to describe a variety of impairments where (in English at least) there may be a paucity or absence of function words in contrast to content words, omission of auxiliary verbs (unlike in LSAs), impaired inflection on verbs, impairments in syntactic comprehension, nominalisation of verbs,

275

276

Chris Code

impaired theta-role assignment and mapping from semantics to syntax and tense marking, and many see it as a central syntactic disorder (for review see Caplan, 1987; Grodzinsky, 2000; DeBleser, Bayer & Luzzatti, 1996; Perlman Lorch, 1989). Agrammatism is unlikely to be a unitary syntactic disorder, and a range of alternative underlying impairments have been posited including: impaired working memory; verb and other lexical access deficits; theta-role assignment impairments; systemic adaptation to impaired mechanisms; economy of effort; among others (see Perlman Lorch, 1989, for review of the contenders). But we know that forms of agrammatism can evolve with recovery from LSAs, and severe forms of apraxia of speech. It is also clear that agrammatism itself can evolve with time, changing the pattern of syntactic deficit with recovery from severe to milder forms (e.g., Guasti & Luzzatti, 2002) with features more likely reflecting systemic adaptation and compensation at more chronic stages. A ‘motor’ element to agrammatism has been directly claimed or implied by a number of theories (e.g., Isserlin, 1922; Lenneberg, 1975; Goodglass, 1976) supporting theories claiming a gestural basis to syntax (e.g., Armstrong, Stokoe & Wilcox, 1994). It may therefore be profitable for future research to examine more closely the evolving relationships between agrammatism and apraxias following brain damage. The close relationships that appear to exist between acquired agrammatism and speech apraxia appear to converge well with recent investigations with the KE family, some members of which share a developmental and inherited impairment of speech and facial praxis, syntactic processing and more general language impairments, apparently due to damaged expression of the gene FOXP2 for those family members with the condition, but not those without the condition (see Marcus & Fisher, 2003 and Corballis, 2004 for recent relevant discussion). Identifying FOXP2 as ‘the gene for language’ is clearly beyond the available data, but the behavioural investigations have identified facial and speech apraxia as core elements of the condition. The implications for a close genetic and evolutionary relationship between facial action, speech action and syntax would appear to be tantalisingly clear (Corballis, 2003). Most lesion studies have been of people who have had a stroke caused by cerebrovascular impairments of blood supply to the brain — the main population, or have experienced a traumatic brain injury. These acquired forms of brain damage are said to cross functional systems within the brain. More recently, the study of the impairment to cognitive processes associated with progressing neurological disease affecting relatively, and progressively, circumscribed aspects of cognitive functioning have increased (Croot, Paterson & Hodges, 1998; Garrard & Hodges, 1999; Mesulam, 1982; Harasty, Halliday, Kril, & Code, 1999). Progressive neural degeneration is said to follow more vertically represented functional

First in, last out?

systems, which implies that it may follow the ontogenic and phylogenic development of neural systems. This promises a new appreciation of the relationships between different modular architectures of cognitive systems and their neural representation. The study of the patterns of functional impairment from degenerative lesions should therefore converge with other research on the evolution of functions. Indeed, the question of the relationship between speech and gesture in evolution may be illuminated through the study of relationships between aphasia and apraxia in progressive conditions. There have been recent studies of speech, language and gestural impairments, that may represent progressive damage to more vertically, and phylogenically delimited functional systems. There is a range of different kinds of degenerative speech planning, programming, co-ordination and execution impairments arising from frontal damage that have been described, besides the ‘apraxia of speech’ that recent models have restricted to the four ‘kernel’ characteristics of sound distortions, prolonged segment durations, prolonged intersegment durations and disturbed prosody (McNeil, Robin & Schmidt, 1997; McNeil, Doyle & Wambaugh, 2000; McNeil, Pratt & Fossett, 2004; Van der Merwe, 1997). These writers restrict apraxia of speech to impairment in the planning stage of speech production. Longitudinal investigation of these impairments could contribute significantly to understanding the evolution of speech. A small number of case studies where degenerating speech combined with limb apraxic impairments have been mapped, have been described in recent years. We are currently working with a man with a progressive degenerative condition who, over a ten year period, developed a unique and relatively rare progressive speech production impairment degenerating to virtual mutism. Degenerating alongside are orofacial and gestural movements producing limb and orofacial apraxias. Detailed longitudinal testing over this period shows that his language system and general cognition were relatively unimpaired. More recently, however, he has developed agrammatic agraphia. Current analyses of our participant’s progressing disability are examining the temporal relations emerging between speech impairment, apraxias and agrammatism, and comparison with neural degeneration. Such studies could make a further contribution to the question of the evolutionary relationship between speech, action, gesture and syntax. Investigation of the evolution of language is rather like the artist’s subjective impressions of the in camera personalities and events in a courtroom where ‘objective’ recording on film is not permitted. I have attempted to outline a possible contribution to the evolution of human language from an examination of aphasic phenomena, most specifically a limited subset of LSAs, and their relation to other speech, syntax and action-related impairments arising from frontal damage. This artist’s impression could well be yet another half-baked set of wild speculations

277

278

Chris Code

on the origins of language, and there are certainly gaps in the arguments I have sketched. Where I accept that I may have speculated wildly is in the possible emergent tempero-sequential relationship between different subtypes of LSA in evolution, hypothesising that different subtypes may represent some kind of staging of development from single repeated CV expletive and syntactically primitive pronoun+modal/aux constructions, forming a protosyntax stage, to agrammatism, bridging a gap between protolanguage and full syntax. But I have tried to ground my arguments in current models of protolanguage and formulaic language use, and have attempted to show that the origins of LSAs are embedded in contemporary communication and their neurogenic origins in older neural systems that do not figure in classical language areas of the brain. Whatever is the case, I have suggested that the fragments of language remaining for people with aphasia, of this type at least, might constitute fossils holding clues to the origins of human language and are worthy of further investigation by evolutionary linguists and psychologists.

Acknowledgements I am grateful to Alison Wray and Michael Arbib for generous feedback on an earlier draft and to the Hanse Institute for Advanced Study, Delmenhorst, Germany, where I was a Fellow during the writing of parts of this paper.

References Abry, C., Stefanuto, M., Vilain, A., & Laboissière, R. (2002). What can the utterance “tan, tan” of Broca’s patient Leborgne tell us about the hypothesis of an emergent “babble-syllable” downloaded by SMA? In Durand, J. & Laks, B. (Eds.), Phonetics, phonology, and cognition (pp. 226–243). Oxford: Oxford University Press. Alajouanine, T. (1956). Verbal realization in aphasia. Brain, 79, 1–28. Arbib, M. A. (2005). From Monkey-like action recognition to human language: An evolutionary framework for neurolinguistics. Behavioral and Brain Sciences, 28, 105–167. Armstrong, D. F., Stokoe, W. C., & Wilcox, S. E. (1994). Signs of the origin of syntax. Current Anthropology, 35, 349–368. Basso, A., Lecours, A-R., Moraschini, S., & Vanier, M. (1985). Anatomoclinical correlations of the aphasias as defined through computerized tomography: Exceptions. Brain & Language, 26, 201–209. Bermudez, J. L. (1998) The Paradox of self-consciousness. Cambridge, MA: The MIT Press. Bickerton, D. 1990). Language and species. Chicago: University of Chicago Press.

First in, last out?

Bickerton, D. (1998). Catastrophic evolution: The case for a single step from protolanguage to full human language. In C. Knight, M. Studdert-Kennedy, & J. A. Hurford (Eds.). The Evolution of Language. Cambridge: Cambridge University Press. Bishop, D. V. M. (2002). Motor immaturity and specific language impairment: Evidence for a common genetic basis. American Journal of Medical Genetics: Neuropsychiatric Genetics, 114, 56–63. Blanken, G., De Langen, E. G., Dittmann, J., & Wallesch, C-W. (1989). Implications of preserved written language abilities for the functional basis of speech automatisms (recurring utterances): A single case study. Cognitive Neuropsychology, 6, 211–249. Bradshaw, J. L. (2001). Developmental disorders of the frontostriatal system. Hove: Psychology Press. Brown, R.F. (1973). A First Language: The early stages. New York: Harcourt Brace Jovanovich. Brunner, R. J., Kornhuber, H. H., Seemuller, E., Suger, G., & Wallesch, C-W. (1982). Basal ganglia participation In language pathology. Brain & Language, 16, 281–299. Caplan, D. (1985) Neurolinguistics and linguistic aphasiology. Cambridge: Cambridge University Press. Chung, K.K.H., Code, C. & Ball, M.J. (2004). Speech automatisms and recurring utterances from aphasic Cantonese speakers. Multilingual Communication Disorders, 2, 32–42. Clark, H.H. & Clark, E.V. (1977). Psychology of language. New York: Harcourt Brace Jovanovich. Code, C. (1982a). Neurolinguistic analysis of recurrent utterances in aphasia. Cortex, 18, 141– 152. Code, C. (1982b). On the origins of recurrent utterances in aphasia. Cortex, 18, 161–164. Code, C. (1987). Language, aphasia, and the right hemisphere. London: Wiley. Code, C. (1994). Speech automatism production in aphasia. Journal of Neurolinguistics, 8, 135– 148. Code, C. (1996). Speech from the isolated right hemisphere? Left hemispherectomy cases EC and NL. In C. Code, C-W. Wallesch, Y. Joanette, & A.-R. Lecours (Eds.), Classic cases in neuropsychology. Mahwah, NJ: Lawrence Erlbaum. Code, C. (1997). Can the right hemisphere speak? Brain & Language, 57, 38–59. Code, C. (1998). Models, theories and heuristics in apraxia of speech. Clinical Linguistics & Phonetics, 12, 47–65. Code, C. (2005). Syllables in the brain: Evidence from brain damage. In Hartsuiker, R. J., Bastiaanse, R., Postma, A., & Wijnen, F. N. K. (Eds.), Phonological encoding and monitoring in normal and pathological speech. Hove, Sussex: Psychology Press. Code, C., & Joanette, Y. (2003). Neural plasticity in the control of language in the adult brain: the role of the separated right hemisphere’s in cases PS, VP and JW. In: C. Code, C-W Wallesch, Y. Joanette & A.-R. Lecours (Eds.), Classic Cases in Neuropsychology: Volume II. Hove, Sussex: Psychology Press. Code, C., Tree, J., Dawe, K., Kay, J., Ball, M. J., & Edwards, M. (2003). Progressive cortical anarthria with apraxias. Paper presented at ‘Vocalise to Localise’ Conference, University of Grenoble, January. Code, C., Wallesch, C-W., Joanette, Y & Lecours, A-R. (Eds.). (1996). Classic cases in neuropsychology: Volume I. Hove, Sussex: Lawrence Erlbaum Associates. Code, C., Wallesch, C-W., Joanette, Y, & Lecours, A.-R. (Eds.). (2003). Classic cases in neuropsychology: Volume II. Hove, Sussex: Taylor & Francis.

279

280

Chris Code

Corballis, M. C. (2002). From hand to mouth: The origins of language. Princeton: Princeton University Press. Corballis, M. C. (2003). From mouth to hand: Gestures, speech, and the evolution of right-handedness. Behavioral & Brain Sciences, 26, 199–260. Corballis, M. C. (2004). FOXP2 and the mirror system. Trends in Cognitive Science, 8, 95–96. Croot, K., Patterson, K., & Hodges, J. R. (1998). Single word production in nonfluent progressive aphasia. Brain & Language, 61, 226–273. Darwin, C. (1872) The expression of the emotions in man and animals. London: John Murray. DeBleser, R., & Poeck, K. (1985). Analysis of prosody in the spontaneous speech of patients with CV-recurring utterances. Cortex, 21, 405–416, DeBleser, R., Bayer, J., & Luzzatti, C. (1996). Linguistic theory and morphosyntactic impairments in German and Italian aphasics. Journal of Neurolinguistics, 9, 175–185. Dronkers, N. (2000). The neural architecture of language disorders. In M. Gazzaniga (Ed.), The Cognitive Neurosciences (pp. 949–960). Cambridge. MA: MIT Press. Duffy, J.R. (2006). Apraxia of speech in degenerative neurologic diseases. Aphasiology, 20, 511– 527. Feldman, J., & Narayanan, S. (2004). Embodied meaning in a neural theory of language. Brain & Language, 89, 385–392. Garrard, P., & Hodges, J. R. (1999). Semantic dementia: Implications for the neural basis of language and meaning. Aphasiology, 13, 609–623. Glenberg, A. M., & Kaschak, M. P. (2002). Grounding language in action. Psychonomic Bulletin & Review, 9, 558–565. Glenberg, A. M. & Kaschak, M. P. (2003). The body’s contribution to language. Psychology of Learning and Motivation: Advances in Research & Theory, 43, 93–126. Goodglass, H. (1976). Agrammatism. In H. Whitaker, & H. A. Whitaker (Eds.), Studies in Neurolinguistics, Vol. I. New York: Academic Press. Greenfield, P. M. (1991). Language, tools and brain: The ontology and phylogeny of hierarchically organized sequential behavior. Behavioral & Brain Sciences, 14, 531–595. Grodzinsky, Y. (2000). The neurology of syntax: Language use without Broca’s area. Behavioral and Brain Sciences 23, 1–21. Guasti, M. T., & Luzzatti, C. (2002). Syntactic breakdown and recovery of clausal structure in agrammatism Brain & Cognition, 48, 385–391. Harasty, J., Halliday, G. M., Kril, J., & Code, C. (1999). Specific temporoparietal gyral atrophy reflects the pattern of language dissolution in Alzheimer’s disease. Brain, 122, 675–686. Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The faculty of language: What is it, who has it, and how did it evolve? Science, 298, 1569–1579 Herschkowitz, N. (2000). Neurological bases of behavioural development in infancy. Brain & Development, 22, 411–416. Isserlin, M. (1922). Über Agrammatismus. Zeitschrift für die gesamte Neurologie und Psychiatrie 75, 332–410. Jackendoff, R. (1999). Possible stages in the evolution of the language capacity. Trends in Cognitive Science, 3, 272–279. Jackson, H. J. (1874). On the nature of the duality of the brain. In J. Taylor (Ed.), Selected Writings of John Hughlings Jackson, Vol. 11. London: Staples Press. Jackson, H. J. (1879). On affectations of speech from disease of the brain. In J. Taylor (Ed.), Selected Writings of John Hughlings Jackson, Vol. 11. London: Staples Press.

First in, last out?

Jahanshari, M. & Frith, C. D. (1998). Willed action and its impairments. Cognitive Neuropsychology, 15, 483–533. Jakobson, R. (1968). Child Language Aphasia and Phonological Universals. The Hague: Mouton. James, W. (1890). Principles of Psychology, Vol 1. New York: Dover. Jay, T. (1980). Sex roles and dirty word usage: A review of the literature and a reply to Haas. Psychological Bulletin, 88, 614–621. Jay, T. (1995). Cursing: A damned persistent lexicon. In: D. Herrmann, M. Johnson, C. McEvoy, C. Hertzog, & P. Hertel (Eds.), Basic and applied memory: Research on practical aspects of memory. Hillsdale, NJ: Erlbaum. Joanette, Y., Goulet, P., & Hannequin, D. (1990). Right hemisphere and verbal communication. New York: Springer Verlag. Kimura, D. (1976). The neural basis of language qua gesture. In H. Whitaker, & H. A. Whitaker (Eds.), Studies in Neurolinguistics, Vol II. New York: Academic Press. Lamendella, J. T. (1977). The limbic system in human communication. In H. Whitaker and H. A. Whitaker (Eds.), Studies in Neurolinguistics, Vol. III (pp. 157–222). London: Academic Press. Leach, E. (1966). Anthropological aspects of language: Animal categories and verbal abuse. In E. H. Lennberg (Ed.), New Directions in the Study of Language. Cambrdidge, MA: MIT Press. Leckman, J. F., Knorr, A. M., Rasmussen, A. M., & Cohen, D. J. (1991). Basal ganglia research and Tourette’s syndrome. Trends in Neuroscience, 14, 94. Lenneberg, E. (1975). In search of a dynamic theory of aphasia. In E. Lenneberg & E. Lenneberg (Eds.), Foundations of language development: A multidisciplinary approach, Vol 2. New York: Academic Press. MacLean, P. D. (1987). The midline frontolimbic cortex and the evolution of crying and laughter. In: E. Perecman (Ed.), The frontal lobes revisited. New York: The IRBN Press. McNeil, M. R., Robin, D. A., and Schmidt, R. A. (1997). Apraxia of speech: Definition, differentiation, and treatment. In M. R. McNeil (Ed.), Clinical management of sensorimotor speech disorders (pp. 311–344). New York, NY: Thieme. McNeil, M. R., Doyle, P. J., & Wambaugh, J. (2000). Apraxia of speech: A treatable disorder of motor planning and programming. In S. E. Nadeau, L. J. Gonzalez Rothi & B. Crosson (Eds.), Aphasia and language: Theory to practice (pp. 221–266). New York, NY: The Guilford Press. McNeil, M. R., Pratt, S. R., & Fossett, T. R. D. (2004). The differential diagnosis of apraxia of speech. In B. Maason, R. Kent, H. Peters, P. van Lieshout, & W. Hulstijn (Eds.), Speech motor control in normal and disordered speech. Oxford: Oxford University Press. MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. Behavioral and Brain Sciences, 21, 499–546. MacNeilage, P. F., & Davis, B. L. (2001). Motor mechanisms in speech ontogeny: Phylogenetic, neurobiological and linguistic implications. Current Opinion in Neurobiology, 11, 696–700. MacNeilage, P. F., & Davis, B. L. (2005). A cognitive-motor syllable frame for speech production: Evidence from neuropathology. In W. J. Hardcastle & J. Mackenzie Beck (Eds.), A Figure of Speech: A Festschrift for John Laver. Mahwah: NJ: Lawrence Erlbaum Associates. Marcus, G. F., & Fisher, S. E. (2003). FOXP2 in focus: What can genes tell us about speech and language? Trends in Cognitive Sciences, 7, 257–262. Martin, A., Ungerleider, L.G, & Haxby, J. V. (2000). Category specificity and the brain: the sensory/motor model of semantic representations of objects. In M. Gazzaniga (Ed.), The Cognitive Neurosciences: Vol II (pp. 1023–1036). Cambridge, MA: MIT Press.

281

282

Chris Code

Mesulam, M. M. (1982). Slowly progressive aphasia without generalized dementia. Annals of neurology, 11, 592–598. Nespoulous, J-L., Code, C., Virbel, J., & Lecours, A-R. (1998). Hypotheses on the dissociation between ‘referential’ and ‘modalizing’ verbal behaviour in aphasia. Applied Psycholinguistics, 19, 311–331. Piaget, J. (1954). The construction of reality in the child. Translated by M. Cook. New York: Basic Books. Oelschlaeger, M. L., & Damico, J. S. (1998). Spontaneous verbal repetition: A social strategy in aphasic conversation. Aphasiology, 12, 971–988. Perlman Lorch, M. (1989). Agrammatism and paragrammatism. In C. Code (Ed.), The Characteristics of Aphasia. Hove: Psychology Press. Pinker, S. (1994). The language instinct. Penguin: Harmondsworth. Speedie, L. J., Wertman, E., T’air, J., & Heilman, K. M. (1993). Disruption of automatic speech following a right basal ganglia lesion. Neurology, 43, 1768–1774. Tomasello, M. & Call, J. (1997). Primate cognition. Oxford: Oxford University Press. Van Lancker, D. & Cumming, J. L. (1999). Expletives: Neurolinguistic and neurobehavioral perspectives on swearing. Brain Research Reviews, 31, 83–104. Van Lancker-Sidtis, D. (2004). When novel sentences spoken or heard for the first time in the history of the universe are not enough: Toward a dual-process model of language. International Journal of Language & Communication Disorders, 39, 144. Van der Merwe, A. (1997). A theoretical framework for the characterization of pathological speech sensorimotor control. In M. R. McNeil (Ed.), Clinical management of sensorimotor speech disorders (pp.1–25). New York: Thieme. Varley, R. & Siegal, M. (2000). Evidence for cognition without grammar from causal reasoning and ‘theory of mind’ in an agrammatic aphasic patient. Current Biology, 10, 723–726. Varley, R., Siegal, M., & Want, S. (2001). Severe impairment in grammar does not preclude theory of mind. Neurocase, 7, 489–493. Walker, S. (1987). The evolution and dissolution of language. In A. W. Ellis (Ed.), Progress in the Psychology of Language: Vol III. London: Lawrence Erlbaum Assoc. Webster, J., Franklin, S., & Howard, D. (2004). Investigating the sub-processes involved in the production of thematic structure: an analysis of four people with aphasia. Aphasiology, 18, 47–68. Wray, A. (1998). Protolanguage as a holistic system for social interaction. Language & Communication, 18, 47–67. Wray, A. (2002a). Formulaic language and the lexicon. Cambridge: Cambridge University Press. Wray, A. (2000b). Holistic utterances in protolanguage: The link from primates to humans. In C. Knight, M. Studdert-Kennedy, & J. A. Hurford (Eds.), The evolution of language. Cambridge: Cambridge University Press.

About the author Chris Code is University Fellow in the School of Psychology, University of Exeter and Foundation Professor of Communication Sciences & Disorders, University of Sydney. He qualified as a Speech and Language Pathologist and Therapist in 1972, completed his master’s at the University of Essex in Phonetics and Linguistics in 1976, and completed his PhD in Neuropsychology at the University of Cardiff, Wales in 1983. He co-founded the journal Aphasiology and published

First in, last out?

the monograph Language, Aphasia and the Right Hemisphere (John Wiley) in 1987. His research interests range from aphasia, the neural and cognitive representation of language, speech, action and calculation and the evolution of language and speech to psychosocial, clinical and theoretical explorations in aphasia.

283

Name index

A Abbs J. H. 146, 154 Abdullaev Y. G. 169, 173 Abry C. 1, 7, 8, 10, 11, 14, 27, 87, 97, 102, 105, 208, 211, 212, 224, 232–234, 237, 257, 269, 274, 278 Ackermann H. 175 Acredolo L. 183, 205 Adamson L. R. 70, 80, 82 Akre K. 52, 59, 63 Alajouanine T. 266, 267, 272, 278 Allen J. 174 Allman J. M. 162, 166, 171, 173, 176 Allopenna P. 257 Ameka F. 2, 10 Anderson J. R. 79, 85 Anderson-Wood L. 85 Andersson K. 51, 63 Andrews R. 65, 66 Apollonios Dyskolos 1 Arbib M. A. 29, 34, 41, 43, 44, 57, 65, 107, 110–112, 118, 128–131, 133–135, 139, 142, 146–154, 156, 161, 174, 177, 263, 269, 275, 278 Arcadi A. C. 41, 42 Ardila A. 111, 129 Armstrong D. F. 42, 48, 62, 276, 278 Armstrong E. 162, 163, 174 Arrouas Y. 237 Ash T. A. 104 Astafiev S. V. 8, 10 Atal B. S. 96, 102, 236 Auerbach E. J. 175

B Baciu M. 11, 257 Bacon S. J. 175 Badin P. 11, 87, 91, 98, 103, 208, 233, 234, 237 Bailey D. 256 Baillargeon R. 244, 257 Bailly G. 209, 233, 234 Baird G. 82 Bakeman R. 80, 82, 86 Baldwin D. A. 81, 82 Ball M. J. 272, 279 Barbas H. 167, 174 Bard K. A. 69, 70, 76, 80, 82–84 Barillas Y. 73, 85 Barney H. L. 91, 97, 101, 105 Baron-Cohen S. 69, 72, 73, 82, 171, 174 Basso A. 262, 278 Bates E. 57, 62, 67, 70, 72, 73, 78, 82, 189, 204, 206, 246, 253, 255 Baumann A 130 Bayer J. 276, 280 Beautemps D. 233, 237 Beckett C. 74, 85 Beecher M. D. 65, 86 Bekken K. E. 64 Belin P. 52, 62 Bell A. 134, 143, 154 Bell M. B. 28 Bellagamba F. 244, 255 Bellugi U 49, 63, 129, 157 Benga O. 127–129, 159, 172, 174, 177, 268 Benoît C. 233 Benson D. F. 111, 129

Benuzzi F. 42 Bergmann G. 75, 82 Bering J. 69, 85 Bermudez J. L. 273, 278 Berntson G. G. 62 Berrah A. R. 209, 232, 233 Berthommier F. 236 Bessière P. 207, 209, 217, 233, 235, 238 Beynon P 14, 27 Bickerton D. 3-6, 10, 110, 263, 272, 274, 278, 279 Biederman J. 174 Binkofski F. 42 Bishop D. V. M. 275, 279 Bishop M. 74, 80, 84 Bladon A. 212, 233 Blanc J. M. 243, 255, 256 Blanken G. 265, 279 Blaschke M. 79, 83 Bloom L. 4, 6, 10, 183, 204, 255, 257 Bobe L. 64 Bobholz J. A. 175 Boë L.–J 87, 93, 94, 96, 97, 102, 103, 105, 106, 208, 210, 212, 214, 216, 219, 224, 227, 232–234, 236, 237 Boesch C. 2, 10, 14, 27 Bohning D. E. 176 Bonda E. 38, 42 Bonder L. J. 96, 103 Bonvillian J. D. 83 Borzellino G. 49, 62, 72, 82 Bothorel A. 210, 234 Botvinick M. 174 Boule M. 88, 89, 103 Bowerman M. 244, 255, 256

286

Name index

Boysen S. T. 60, 62 Bradley N. S. 112, 130 Bradshaw J. L. 48, 62, 269, 274, 279 Braem P. B. 128, 129, 130 Brammer M. J. 43 Braun A. R. 63 Braun K. 168, 174 Breazeal C. 234 Bredenkamp D. 74, 85 Briggs R. W. 175 Brinck I. 75, 76, 83 Broadfield D. C. 63 Brooks D. J. 43 Brooks R. A. 208, 234 Brotherton P. N. M. 27 Browd S. R. 175 Brown C. M. 155, 243, 256 Brown G. R. 160, 176, Brown J. W. 169, 170, 174, Brown R.F. 273, 279 Brunner R. J. 268, 279 Buccino G. 40, 42 Burnham D. 234, 237 Burr D. 88, 103 Bush G. 164, 165, 173, 174, 177 Butcher C. 180, 183–185, 189, 191, 202, 204, 205 Butterworth G. 7, 10, 67, 72, 73, 75, 78, 81–83 Byrd D. 113, 129 Byrne R. W. 31, 45, 86, 160, 174 C Call J. 18, 28, 45, 71, 75–77, 83, 86, 117, 161, 177, 263, 282 Calvert G. A. 40, 43 Camaioni L. 58, 65, 67, 82 Camarda R. 43, 45 Campbell R. 40, 43, 213, 234, 236, 237 Canessa N. 42 Cantalupo C. 8, 10, 40, 43, 50, 55, 61, 63, 64 Cantero M. 54, 60, 64, 71, 84 Capdevilla A 177

Capirci O. 180–186, 188, 189, 190, 191, 195, 196, 199, 201, 203–206 Caplan B. 155 Caplan D. 257, 275, 279 Caplan R. 171, 174, Capobianco M. 125, 131, 206 Card J. 171, 176 Cardinal R. 165, 174 Cardoner N 177 Carey S. 244, 256 Carlisle R. C. 88, 103 Carpenter M. 45, 67, 71, 75, 83, 85, 86 Carter C. 173, 174, 177 Caselli M. C. 181, 183, 204–206 Cassandra A. R. 218, 235 Castle J. 74, 85 Catford J. C. 97, 103 Cathiard M. A. 10, 11, 237 Chadwick P. 27 Chaline J. 88, 103 Chang J. J. 102 Charman T. 82 Charrier I. 60, 63 Chase R. A. 236 Chater N. 256 Cheboi K. 81, 86 Cheney D. L. 2, 4, 10, 11, 14, 27–30, 43, 53, 65, 70, 83 Cheverud J. 63 Childers S. R. 177 Chistovich L. A. 212, 234 Choi S. 257 Chollet F. 38, 43 Chomsky N. 1, 10, 30, 43, 44, 107, 161, 175, 240, 249, 256, 257, 262, 280 Christiansen M. 10, 11, 256 Chugani H. 174 Chung E. V. 272, 279 Clark M. J. 82, 103, 162, 174, 180, 204, 255, 256, 273, 279 Clutton-Brock T. H. 15, 16, 27 Code C. 145, 154, 261, 263, 264, 266–270, 272, 275, 276, 279, 280, 282

Cohen A. H. 148, 154 Cohen D. J. 268, 281 Cohen J. 175 Cohen R. A. 170, 175 Condillac B. de 59, 63 Conroy G. D. 63 Contaldo A. 180, 204 Coppens Y. 81, 86 Corazza S. 63 Corballis M. C. 29–31, 43, 48, 50, 60, 61, 63, 64, 81–83, 125, 128, 129, 154, 161, 162, 168, 175, 263, 269, 271, 275, 276, 280 Corbetta M. 10 Corina D. P. 49, 63, 121, 129 Corkum V. 75, 80, 85 Corsi C. 186, 204 Coulmas F. 126, 129 Cox A. 82 Crain S. 240, 256 Crelin E. S. 87, 88, 91, 103, 104 Crockford C. 2, 10, 14, 27 Croft W. 241, 253, 256 Croot K. 276, 280 Crosson B. 165, 175, 281 Crothers J. 88, 103 Crozier S. 62 Cruse H. 160, 175 Csibra G. 160, 175 Curione G. M. 104 D Dalston E. 11 Damasio A. R. 170, 175 Damico J. S. 266, 282 Danchin I. 69, 83 Darwin C. 134, 154, 271, 280 David A. S. 43, 67, 86, 127 Davis B. L. 10, 105, 109, 113, 114, 116–118, 125, 127, 128, 130, 133, 134, 139, 142–145, 148, 154–157, 210, 219, 232–234, 236, 237, 264, 269, 281 Davis D. R. 67, 71, 85 Dawe K. 279 De Langen E. G. 265, 279

De Waal F. B. M. 43, 79 DeBleser R. 276, 280 DeDe G. 49, 64 Dedieu E. 233 Dehaene S. 173, 176 Delgutte B. 213, 234 Demuth K 243, 257 Dennett D. C. 160, 175 DePaul R. 154 Deppe M. 64 Deprez V. 257 Dépy D. 50, 66 Deus J 177 Devescovi A. 125, 131, 206, 255 Devinsky O. 164, 170, 175 Di Bitetti M. S. 14, 26, 27 Diard J. 235 Diessel H. 9 DiGirolamo G. J. 173, 177 DiPiero V. 43 Dittmann J. 265, 279 Dittus W. 53, 63 Dodane C. 255 Dodd B. 213, 234, 236, 237 Dogil G. 169, 175 Dolan R. J. 43 Dominey P. F. 239, 241, 249, 253-258 Donald M. 118, 123, 129, 135, 137, 151, 154 Dowd D 129 Doyle P. J. 277, 281 Dräger B. 64 Drew A. 82 Dronkers N. 262, 280 Dubno J. R. 176 Ducey V. 10 Duder C. 11 Duffy J. R. 270, 275, 280 Dunbar R. I. M. 138, 154 Dunphy-Lelii S. 85 E Edwards M. 279 Eguchi S 97, 103 Engberg-Pedersen E. 5, 10

Name index

English and Romanian Adoptees (ERA) 85 Erting C. 180, 182, 185, 204, 205 Erwin J. M. 173, 176 Escudier P. 236, 237 Ettlinger G. 79, 83 Evans A. C. 42, 44, 45 Evans C. S. 14, 26, 28 Evans L. 14, 26, 28 Everitt B. 174 F Fadiga L. 42, 43, 45, 63, 131, 156 Fagot J. 50, 66 Falk D. 51, 63, 88, 103, 125, 129, 130, 142, 154, 170, 175 Fan J. 173, 177 Fant G. 91, 92, 97, 99, 103, 216, 234, 236 Farnell B. 127, 129 Feinberg T 129 Feldman J. A. 240, 256, 263, 275, 280 Feldman M. W. 160, 175 Ferguson C. A. 142, 154 Fern A. 256 Ferrand L. 88, 105 Ferrari P. F. 29, 34, 36, 41, 43, 45, 111, 113, 122, 123, 130, 149, 150, 154 Fields H. 176 Fillmore C. J. 6, 10, 183, 205 Fink G. R. 42 Fischer J. M. 14, 28, 44 Fisher C. 244, 256 Fisher S. E. 276, 281 Fitch R. H. 60, 63 Fitch W. T. 1, 10, 44, 94, 103, 161, 175, 257, 262, 280 Fleet W. S. 157 Fletcher L. 13, 28, 131, 235 Fogassi L. 29, 33, 42–45, 63, 64, 111, 113, 122–124, 127, 130, 131, 145, 150, 154–156 Fontaine A. 62

Forness W. H. 83 Fornito A. 177 Fossett T. R. D. 277, 281 Fouts R. S. 71, 84 Fox N. 171, 176 Fox P. T. 44 Frackowiak R. S. 43 Fragaszy D. M. 69, 82, 83 Franck N. 257 Franco F. 67, 72, 75, 82, 83 Franklin S. 275, 282 Frazier J. A. 174 Freeman A. J. 28, 175 Freund H. J. 42 Frey S. 42, 217 Friederici A. D. 243, 256 Frith C. D. 170, 171, 175, 269, 281 Frith U. 170, 171, 175 G Gabbott P. L. A. 167, 175 Gabioud B. 210, 234 Gaffan D. 175 Gallagher H. L. 128, 170, 171, 175 Gallese V. 33, 42–45, 56, 63, 64, 130, 131, 154-156 Gannon P. J. 50, 63 Gardner B. T. 73, 83 Gardner R. A. 73, 83 Garrard P. 276, 280 Gaynor D. 27 Gehring W. 175 Gentilucci M. 33, 43, 45, 149, 154 Gentner D. 257 George M. S. 7, 82, 176 Georgieff N. 257 Getty L. A. 103 Ghazanfar A. A. 29, 30, 43, 52, 63, 66, 121, 130 Giambrone S. 69, 85 Gibson K. R. 64, 78, 80–83, 85 Giedd J. 94, 103 Gildersleeve-Neumann C. 139, 154

287

288

Name index

Gilissen E. 173, 176 Gillberg C. 174 Gillette J. 247, 257 Giraldeau L.–A. 69, 83 Givan R. 256 Givón T. 7, 10 Gleitman H. 257 Gleitman L. 257 Glenberg A. M. 263, 275, 280 Glotin H. 233 Gökçay D. 175 Goldberg A. 240, 241, 244, 253, 255, 257 Goldberg G. 145, 154 Goldin-Meadow S. 42, 43, 72, 84, 125, 130, 146, 154, 180, 183–185, 189, 191, 202, 204, 205 Goldman H. I. 142, 155 Goldstein U. G. 94, 102, 103, 211, 234 Golembiowski M. 257 Golinkoff R. M. 70, 78, 83, 84 Gomes A. 171, 176 Gómez J. C. 80, 83 Gommery D. 81, 86 Gonzales-Rothi L. 157 Goodall J. 30, 43, 53, 63, 77, 86 Goodglass H. 276, 280 Goodman MB 130 Goodwyn S. 205 Gould S. J. 143, 144, 155 Goulet P. 268, 281 Gouzoules H. 53, 63 Gouzoules S. 53, 63 Grafton S. T. 38, 43 Greenfield P. M. 257, 263, 275, 280 Grodd W. 175 Grodzinsky Y. 256, 275, 280 Groothues C. 74, 85 Grossi G. 49, 63 Grover L. 67, 83 Guasti M. T. 276, 280 Guenther F. H. 209, 234 Guerin B. 103

Guiard-Marigny T. 211, 235 Guidetti M. 182, 189, 191, 205 Guthrie D. 174 Guyot E. 257 H Hacia J. 167, 175 Hadlang K. A. 175 Hagoort P. 155, 256 Haider H. 175 Hakeem J. M. 171, 173 Hall J. 174, 236 Halliday G. M. 276, 280 Hamner M. B. 176 Hannequin D. 268, 281 Harasty J. 276, 280 Hardcastle W. J. 105, 214, 235, 236, 281 Harding C. G. 78, 84 Hare B. 28 Hari R. 56, 64 Harnad S. 59, 65 Hauser M. D. 1, 10, 14, 26, 28–30, 43, 44, 51-53, 59, 60, 63, 66, 161, 175, 254, 257, 262, 272, 280 Haxby J. V. 263, 281 Hayashi M. 166, 175 Heffner H. E. 51, 63 Heffner R. S. 51, 63 Heilman K. M. 128, 157, 175, 268, 282 Heim J.–L. 87, 89, 103 Heimbuch R. C. 104 Henderson L. M. 171, 176 Henning A. 75, 85 Henningsen H. 64 Hepper P. G. 49, 63, 64 Herschkowitz N. 273, 280 Hertwig O. 144, 155 Hess J. 79, 84 Hewes G. W. 47, 64, 147, 155 Hill E. M. 84, 162, 174 Hillenbrand J. 97, 103 Hirsh I. J. 97, 103 Hobson R. P. 74, 80, 84

Hockett C. F. 147, 155 Hodges J. R. 276, 280 Hoen M. 254, 256, 257 Hof P. R. 167, 173, 176 Hoffman E. A. 105 Höhle B. 243, 257 Holloway R. L 63 Holowka S. 49, 64, 65 Honda K. 87, 93, 94, 103, 104 Hoole P. 214, 235 Hooper J. B. 134, 143, 154 Hopkins B. 49, 65 Hopkins W. D. 8, 10, 40, 43, 50, 54–56, 60, 64, 65, 67, 69, 70, 71, 73, 75, 82, 84 Horwitz A. R. 176 Hostetter A. B. 71, 81, 84 Houghton P. 88, 104 Howard D. 275, 282 Huber J. E. 97, 104 Hughlings Jackson J. 261, 265, 280 Hurford J. H. 8, 10, 64, 111, 120, 129, 130, 235, 237, 279, 282 I Indefrey P. 146, 155 Inoue-Nakamura N. 84 Inui T. 254, 256 Iriki A. 79, 84 Ishibashi H. 79, 84 Isserlin M. 276, 280 Itakura S. 79, 84 Ito M. 166, 175 Iverson J. M. 49, 64, 72, 82, 84, 125, 130, 181, 183, 186, 188, 191, 204, 205 J Jackendoff R. 1, 6, 10, 11, 137, 155, 161, 176, 241, 244, 249, 254, 255, 257, 272, 280 Jackson H. J. 265, 268, 280 Jacob F. 104, 105, 143, 144, 155 Jahanshari M. 281 Jakielski K. J. 148, 155

Jakobson R. 143, 155, 209, 235, 264, 281 James W. 273, 281 Jaques J. 49, 64 Javanovic J. 177 Jay T. 270, 275, 281 Jays P. R. L. 175 Jenike M. A. 174 Jenkins M. A. 175 Joanette Y. 261, 268, 279, 281 Johansen J. 165, 176 Johnson K. 104 Johnson M. H. 172, 176, 281 Jordan M. 217, 235 Jouventin P. 60, 63 Jürgens U. 30, 44, 117, 130, 168, 176, 177 Jusczyk PW 114, 130, 155 K Kaelbling L. P. 218, 235 Kamp H. 175 Kandel S. 105, 236 Kansky R. 27 Kaplan R. F. 175 Kaschak M. P. 263, 275, 280 Kasuya H. 94, 104, 106 Kawato M. 131 Kay J. 279 Keaveney L. 85 Keeble S. 257 Kellogg L. A. 76, 84 Kellogg W. N. 76, 84 Kendon A. 48, 60, 64, 127, 129, 130, 180, 205 Kenstowicz M. 140, 155 Kent R. D. 209, 214, 235, 281 Keysers C. 44, 45, 64, 130, 131, 155 Kido K. 104 Kimura D. 31, 44, 48, 49, 55, 64, 263, 281 King B. J. 75, 86 Kinney A. 156 Klatt D. 104 Klatt S. 104

Name index

Kluender R. 243, 257 Knecht S. 61, 64 Knight C. 59, 64, 237, 279, 282 Knight R. 175 Knorr A. M. 268, 281 Kohler E. 34, 44, 45, 56, 64, 122, 130, 131, 150, 155 Koopmans-Van Beinum F. 210, 235 Kornhuber H. H. 268, 279 Koski L. 167, 176 Kotovsky L. 244, 257 Krause M. A. 71, 82, 84 Kreppner J. 85 Kril J. 276, 280 Kuhl P. 104, 209, 219, 225, 228, 231, 235 Kumashiro M. 79, 84 Kutas M. 243, 257 L Laboissière R. 208, 233, 235, 278 Ladefoged P. 113, 130 Lagravinese G. 42 Laitman J. T. 88, 104 Lakoff G. 256 Laland K. N. 160, 175, 176 Lalevée C. 10 Lallouache T. 236 Lamendella J. T. 268, 281 Landgren J. L. 214, 235 Lane H. 137, 146, 157 Langacker R. 240, 257 Launey M. 6, 10 Lauritzen 217, 235 Leach E. 270, 271, 281 Leakey R. 88, 104 Leavens D. A. 53, 54, 64, 67, 68, 70, 71, 73, 75, 77, 78, 81, 84, 86 Lebeltel O. 216, 233, 235 Leckman J. F. 268, 281 Lecours A.–R. 261, 262, 272, 278, 279, 282 Lederer A. 257

Lee S. 97, 104 Legerstee M. 73, 85 Lelekov T. 254, 256, 257 LeMay M. 88, 104 Lenneberg E. 276, 281 Leonard C. M. 175 Leslie A. M. 2, 257 Levelt W. J. M. 139, 141, 146, 155 Levinson S. C. 244, 256 Lewin R. 88, 104 Liebal K. 71, 85 Liebal S. 71, 85 Lieberman D. E. 88, 105 Lieberman P. 87, 88, 91, 103, 104, 133, 155 Lieven E. 9 Liljencrants J. 97, 104, 232, 235 Lillo-Martin D. 240, 256 Lindblom B. 97, 104, 214, 232, 235, 236 Liszkowski U. 75, 85 Littman M. L. 218, 235 Lock A. J. 44, 65, 81, 83, 85, 144, 155, 180, 205 Locke J. L. 49, 64 Loevenbruck H. 8, 11 Logothetis N. K. 62, 64 Lohmann H. 64 Lopez A 177 Lorberbaum J. P. 168, 176 Lord C. 85 Lubker J. 235 Lui F. 42 Lumley H. 88, 104 Lund J. P. 146, 155 Luppino G. 32, 43–45 Luppino M. 154 Luu P 164, 165, 173, 174 Luzzatti C. 276, 280 Lydiard R. B. 176 Lyons J. 1, 180, 188, 205 M MacCartney G. R. 49, 64 MacColl A. D. C. 27 MacDonald J. 213, 236

289

290

Name index

MacDonald M. 174 Mackenzie Beck J. 212, 235, 281 MacLean P. D. 44, 162, 168, 176, 268, 281 MacNeilage P. F. 7, 44, 109, 113, 114–118, 125, 127, 128, 130, 133, 134, 139, 141–145, 148, 150, 154–157, 210, 214, 232, 234, 236, 264, 269, 281 MacWhinney B. 253, 255, 258 Maddieson I. 88, 104, 105, 114, 139, 147, 156 Maeda S. 93, 94, 103, 105, 210, 212, 234, 236 Man A. 1, 62, 64, 88, 89, 93, 102, 105, 255 Mandler B. H. 244, 257 Manning M. B. 165, 176 Manser V. 4, 5, 11, 13–18, 26–28 Marchman G. F. 78, 82 Marcus M. 276, 281 Marjanovic P. M. 234 Marler L. 14, 28, 53, 63, 65 Maron A. 175 Martin A. 263, 275, 281 Martin C. C. 44 Martin R. E. 235 Masure M.–C. 62 Matelli M. 43, 45 Matelli R. 154 Mathevon N. 60, 63 Mathews M. V. 102 Matsuzawa T. 84 Matyear C. L. 105, 139, 154, 156, 219, 233, 236, 237 Maximos Planudes 1 Mayberry R. 49, 64, 182 Mayer J. 175 Mazer E. 233, 235 Mc Guire P. 43 McAllister R. 235 McCarthy R. C. 88, 104, 105 McDonald K. 58, 65 McDonough L. 257 McDuffie A. 176 McGurk H. 213, 236

McIlrath G. 27 McInerney S. C. 174, 177 McMinn-Larson L. 64 McNally R. J. 177 McNeill D. 42, 44, 125, 128, 130, 146, 154, 180, 181, 203–205 McNew S. 255 McShane J. 136, 143, 156 Medicus G. 143, 156 Medina J. 257 Mein P. 81, 86 Mekhnacha K. 233 Meltzoff A. N. 102, 104, 105, 209, 219, 225, 228, 231, 235, 236, 244, 257 Ménard L. 93, 94, 97, 105, 212, 236 Menzel C. R. 71, 85 Messa C. 174 Mesulam M. M. 276, 282 Michel G. F. 49, 64 Miikkulainen R. 255, 257 Miles H. L. 73, 83, 85, 86 Miller C. T. 66 Miller S. 60, 63 Mills A. E. 231, 236 Miolo G. 209, 235 Mitchell R. W. 79, 83, 85, 86 Mohr C. M. 175 Moody D. B. 65 Moore C. 75, 80, 82, 85, 86 Moraschini S. 262, 278 Morford M. 183, 184, 185, 205 Morgan J. L. 243, 257, 258 Morgan K. 82 Morrell M. J. 170, 175 Morris D. H. 45, 88, 105, 157 Morrison J. H. 176 Moser D. J. 175 Mundy P. 170–172, 174, 176 Murata A. 79, 84 Murdock G. P. 143, 156 N Nagell K. 45, 67, 71, 83, 86 Nakagawa A. 172, 174

Narayanan S. 104, 256, 263, 275, 280 Nespoulous J-L. 272, 282 Newman J. D. 44, 168, 169, 174, 176 Newmeyer F. 240, 258 Nightingale N. 82 Nimchinsky E. A. 165, 166, 169, 173, 176 Nishitani N. 56, 64 Noll D. 174 Novak M. A. 79, 84 O O’Rourke P. 49, 62, 72, 82 Oelschlaeger M. L. 266, 282 Olguin K. 71, 86 Olsson K. A. 214, 235 Omohundro S. M. 258 Orliaguet J. P. 233, 236 Ostry D. 65 Osu R. 131 Oztop E. 112, 130 P Pantelis C. 177 Parker S. T. 78, 80–83, 85, 86 Parkinson J. B. 174 Patterson F. G. P. 83 Patterson K. 280 Paus T. 164, 165, 167, 176, 177 Penfield J. 146, 151, 156 Pentland W. 245, 258 Perl D. P. 173, 176, 272, 276, 282 Perlman Lorch M. 272, 276, 282 Perrier P. 103, 234, 236 Peters C. R. 83, 85, 144, 155, 281 Petersen M. R. 51, 65 Peterson G. E. 91, 97, 101, 105, 255 Petitto L. A. 42, 44, 49, 64, 65 Petrides M. 38, 42, 44 Piaget J. 273, 282 Pickard N. 164, 176 Pickford M. 81, 86

Pika S. 71, 85 Pinker S. 1, 11, 147, 156, 240, 258, 263, 282 Piquemal M. 212, 236 Pizzuto E. 125, 127, 128, 130, 131, 181–183, 186, 190, 196, 203–206 Place U. T. 58, 65, 155 Plooij F. X. 53, 65 Poeck K. 266, 280 Poeggel G. 168, 174 Poizner H 129 Pols L. C. W. 212, 236 Porro C. A. 42 Posner M. I. 164-166, 169, 173, 174, 176, 177 Potamianos A. 104 Potì P. 78, 85 Povinelli D. J. 67, 69, 71, 73, 75, 78, 79, 84, 85 Pratt S. R. 277, 281 Pujol J 165, 177 Pulvermüller F. 243, 258 R Radinsky L. B. 135, 156 Ramus F. 254, 256 Rapoport S. 159, 177 Rasa O. A. E. 14, 27 Rasmussen A. M. 268, 281 Rauch S. L. 174, 177 Reaux J. E. 73, 85 Recasens D. 215, 216, 236 Redford M. A. 139, 156 Redican W. K. 138, 156 Regier T. 256 Reichholf J. H. 88, 105 Riecker A. 175 Riede T. 62, 65 Riffkin J. 177 Ringelstein E. B. 64 Rizzolatti G. 29, 32–34, 40–45, 63–65, 110, 111, 129–131, 134, 145, 149, 150, 154–156, 177 Robert-Ribes J. 237 Roberts L. 146, 156

Name index

Robin D. A. 277, 281 Rochat P. 49, 65 Rogers L. 60, 65, 66 Rohlf F. J. 28 Rolfe L. 70, 85 Rönnqvist L. 49, 65 Rootes T. P. 236 Rosen B. R. 174 Ross L. 174 Rothbart M. K. 166, 176 Rousset I. 139, 141, 142, 156 Roy D. 245, 258 Rozzi S. 43 Rubens A. B. 151, 156 Rumbaugh D. M. 58, 65, 73, 82 Rushworth M. F. S. 175 Russell J. L. 71, 84 Rutter M. 74, 79, 85

Sergio L. E. 65 Serkhane J. E. 93, 105, 207, 210, 237 Seyfarth R. M. 4, 10, 11, 14, 27, 28, 29, 30, 43, 53, 58, 65, 70, 83, 128 Shafer D. D. 54, 65 Shahidullah S. 49, 63 Shankar S. G 73, 75, 85, 86, 131 Shannon I. A. 49, 64 Shapiro J. 28 Shi R. 243, 257, 258 Shimizu K. 166, 175 Shipman P. 88, 105 Shore C. 67, 82 Shreeve J. 105 Shulman G. L. 10 Siegel M. I. 88, 103 Sigman M. 174 S Simon P. 234 Sabater-Pi J. 74, 86 Simonyan K. 169, 177 Sadek J. R. 175 Sim-Selley L. J. 168, 177 Salloway S. 175 Siskind J. M. 245, 253, 256, 258 Saltzman E. 113, 129 Skinner B. F. 58, 65 Samson Y. 62 Skinner J. D. 27 Sankoff G. 1 Slobin D. I. 180, 205, 206 Sato M. 8, 10, 11 Slobodchikoff C. N. 14, 28 Savage-Rumbaugh E. S. 58, 65, Smith L. 171, 177 73, 78, 85, 86, 115, 131 Smith S. 255 Savariaux C. 224, 236 Smith-Rohreberg D. 52, 63 Scassellati B. 234 Snyder A. Z. 10 Schaal S. 115, 131 Sock R. 233 Schepartz L. A. 88, 105 Sokal R. R. 18, 28 Schmidt R. A. 277, 281 Speedie L. J. 268, 282 Schroeder M. R. 216, 236 Spiegelhalter D. 235 Schwartz J.–L. 1, 10, 11, 14, 27, Spinozzi G. 78, 85 60, 65, 88, 97, 102, 103, 105, Stanley C. M. 10 207–209, 212–214, 232–234, Stathopoulos E. T. 104 236, 237 Stebbins W. C. 65 Seemuller E. 268, 279 Steels L. 232, 237, 245 Segebarth C. 11, 257 Stefanuto M. 278 Segui J. 88, 105 Steketee J. D. 168, 177 Seidman L. J. 174 Steklis H. D. 59, 65 Seitz R. J. 42 Stenger V. A. 174 Semenza C. 63 Sternad D. 131 Senut B. 81, 86

291

292

Name index

Stokoe W. C. 42, 48, 62, 109, 120, 131, 146, 157, 276, 278 Stolcke A. 256, 258 Story B. H. 94, 105 Striano T. 75, 85 Strick P. L. 164, 176 Studdert-Kennedy M. G. 64, 82, 137, 146, 157, 237, 279, 282 Suckling J. 43 Sufit R. L. 235 Sugarman S. 70, 78, 86 Suger G. 269, 279 Sukigara M 172, 174 Surguladze S. 43 Sussman H. 8, 11 Sutton-Spence R. 128–130 Suzuhi H. 104 Swettenham J. 82 Swick D. 165, 177 T Takemoto H. 102, 105 Tallal P. 60, 63, 65 Talmy L. 240, 244, 258 Tanner J. E. 31, 45, 86 Taylor T. J. 73, 85 Taylor T. T. 131 Thal D. 78, 82, 186, 205 Theall L. A. 73, 85 Thelen E. 49, 50, 64 Thivard L. 62 Thomas R. K. 69, 82, 84, 234, 268 Thompson N. S. 44, 76, 82, 86 Thompson V. E. Thrun S. 217, 237 Tiede M. K. 87, 93, 94, 104 Titze I. R. 105 Tobias S. 186, 205 Tomasello M. 2, 3, 7, 9, 11, 14, 28, 31, 45, 58, 65, 67, 71, 73, 75–77, 81, 83, 85, 86, 109, 131, 135, 157, 161, 177, 206, 240, 241, 244, 253, 255, 257, 258, 263, 282 Torello M. W. 60, 62 Traugott E. 1

Traversay J. 174 Tree J. 279 Trinkaus E. 88, 105 Tucker D. M. 170, 177 Tukey J. W. 102 Tutin C. E. G. 69, 86 U Uchiyama Y. 79, 84 Ulvund S. 177 Umiltà M. A. 33, 44, 45, 64, 130, 155 Ungerleider L. G 263, 281 V Vallée N. 88, 105, 234, 236, 237 Vallejo J 177 Valone T. J. 83 Van der Merwe A. 277, 282 Van Der Stelt J. 210, 235 Van Essen D. C. 10 Van Hoesen G. 170, 175 Van Hooff J. A. R. A. M. 31, 38, 45, 138, 157 Van Lancker-Sidtis D. 263–265, 270, 282 Vanier M. 262, 278 Vannier M. W. 63 Varley R. 273, 282 Vauclair J. 47, 50, 53, 57, 58, 65, 66, 88, 105 Veá J. J. 74, 86 Velakoulis D. 177 Vilain A. 1, 10, 12, 210, 237, 278 Virbel J. 282 Vogt B. A. 44, 164, 170, 175, 176 Vogt L. J. 177 Volterra V. 63, 67, 82, 180–183, 186, 190, 191, 204–206 W Wagner R. H. 69, 83 Walker S. 271, 274, 282 Wallesch C.–W. 261, 265, 269, 279 Wallman J. 47, 66 Wambaugh J. 277, 281

Want S. 273, 282 Warner T. A. 175 Warren J. M. 45, 50, 66 Watson K. 171, 173 Watson R. T. 151, 157 Weber S. H. 256 Webster J. 275, 282 Wein D. 64 Weiss D. 52, 66 Weissenborn J. 243, 257 Welch K. 151, 156 Werker J. F. 258 Wertman E. 268, 282 Wesley M. J. 71, 84 Whalen P. J. 66, 174, 177 Wheeler K. 103 White P. 49, 63, 94, 106 Wilcox S. E. 42, 48, 62, 276, 278 Wildgruber D. 175 Wilhelm S. 177 Wilkerson B. J. 86 Wilkins D. 72, 81, 86 Wilkinson H. 175 Williamson M. 234 Wioland F. 234 Wise R. J. 9, 43 Wood S. J. 74, 177, 224, 237 Woods R. P. 44 Woodward A. L. 244, 258 Wrangham R. W. 2, 10, 86 Wray A. 10, 110, 131, 135, 157, 263, 265, 270, 271, 273, 274, 278, 282 Wu Z. L. 212, 237 Xu F. 244, 256 Yale M. E. 176 Yang C.–S. 94, 106 Yoder P. J. 176 Yücel M. 164, 165, 177 Zerling J.–P. 234 Zilbovicius M. 62 Zilles C. 163, 177 Zilles K. 42 Zoloth S. R. 65 Zuberbuehler K 28 Zuffante P. 175

Subject index

A ability, 13, 14, 60, 61, 67, 78, 80, 81, 104, 108, 112, 114, 115, 118, 120, 121, 122, 125, 126, 127, 135, 141, 152, 160, 166, 167, 170, 172, 198, 203, 209, 210, 219, 221, 225, 227, 228, 230, 231, 235, 239, 241, 244, 249, 253, 254, 255, 257, 265, 271, 273, 279 aboriginal, 127, 130 abstract, 76, 108, 125, 180, 240, 241, 244, 249, 254, 256 acoustic, 9, 13, 14, 17, 22, 24, 26, 28, 34, 63, 65, 87, 88, 90, 92, 93, 94, 96, 97, 98, 99, 102, 150, 169, 207, 209, 210, 212, 216, 219, 221, 224, 227, 229, 231, 234, 236, 237 acoustically, 150, 221, 225, 233 acoustics, 88, 90, 233 acquisition, 9, 14, 30, 41, 49, 59, 86, 125, 155, 157, 204, 213, 232, 234, 240, 242, 253, 255, 256, 257, 258 act, 26, 67, 70, 72, 74, 79, 80, 83, 136, 161, 170, 171, 184, 216, 232 action, 4, 9, 29, 32, 33, 34, 36, 38, 40, 42, 43, 45, 56, 57, 58, 59, 64, 75, 108, 112, 113, 118, 119, 120, 122, 123, 128, 129, 131, 136, 137, 138, 144, 145, 148, 149, 150, 151, 152, 155, 156, 157, 161, 163, 171, 173, 174, 175, 208, 213, 232, 237, 238, 244, 259, 260, 261, 263, 269, 270, 276, 277, 278, 280, 281, 283

activate, 34 activation, 35, 40, 42, 43, 116, 160, 169, 170, 171, 174 active, 9, 33, 34, 62, 122, 123, 131, 149, 150, 241, 249, 250, 260 active vocal filtering, 62 activity, 22, 44, 49, 50, 59, 64, 74, 80, 113, 115, 120, 149, 171, 172, 265 adapt, 118 adaptation, 67, 80, 115, 118, 131, 135, 147, 152, 153, 276 adaptive, 59, 118, 126, 136, 137, 177, 225, 235 adolescent, 69, 76 adult, 8, 16, 63, 75, 94, 102, 104, 105, 108, 118, 121, 137, 141, 166, 180, 181, 182, 188, 189, 196, 203, 208, 209, 211, 223, 224, 225, 237, 240, 258, 267, 279 adulthood, 87, 93, 96, 105, 207, 208, 211, 224, 236, 243 affect, 76, 123, 168, 170, 171, 172 affective, 80, 159, 164, 165, 167, 176, 177, 267 affect-laden, 76 affiliative, 35, 38, 41 Africa, 15 African, 114 agent, 41, 54, 80, 113, 239, 244, 259, 260 agrammatic, 256, 257, 272, 273, 277, 282 agrammatism, 261, 262, 264, 267, 270, 272, 275, 276, 277, 278, 280 agranular, 163

alarm, 1, 3, 4, 6, 9, 11, 13, 14, 15, 16, 17, 18, 19, 22, 24, 26, 28, 30, 52, 58, 62, 65 allometric, 162, Alzheimer, 166, 280 amygdala, 165, 174 analogy, 85, 234, 256 anatomical, 8, 31, 40, 56, 71, 72, 87, 88, 89, 90, 101, 102 anatomy, 43, 102, 103, 104 ancestor, 60, 67, 107, 112, 116, 122, 150 ancestral, 4, 112, 122, 143, 157 ancient, 61, 116, 144, 162, 268, 275 anecdotal, 147, 267 anecdote, 5 anterior cingulate cortex, 128, 129, 159, 160, 161, 162, 163, 164, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177 anthropology, 106 anthropomorphic, 87, 94 apes, 6, 9, 31, 40, 41, 45, 48, 50, 53, 54, 56, 58, 60, 61, 65, 67, 69, 71, 72, 73, 74, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 110, 157, 175, 263 aphasia, 9, 154, 235, 256, 262, 264, 265, 270, 272, 275, 277, 278, 279, 280, 281, 282, 283 aphasic lexical speech, 261, 262 aphasic speech, 261, 265, 266, 268 aphasic speech automatism, 261, 265, 266, 268 aphasiology, 111, 279

294

Subject index

apparatus, 36, 123, 124, 134, 144, 209 approaching predator, 14, 15, 19, 24, 26 apraxia, 261, 262, 270, 275, 276, 277, 279, 281 architecture, 70, 159, 163, 212, 242, 246, 249, 280 area function, 90, 91, 92, 93, 97, 99, 105, 211 argument, 10,112, 241, 244, 249, 251, 252, 253, 254 arithmetic, 265 arm, 6, 7, 9, 49, 116, 119, 131, 188 arm-raise, 6 arousal, 161 arouse, 271 array, 165, 246, 248, 252 articulation, 97, 105, 113, 114, 148, 203, 235, 236, 262 articulatori-acoustic, 209, 231 articulatory, 8, 11, 87, 88, 93, 94, 95, 96, 97, 98, 99, 102, 103, 105, 126, 154, 207, 209, 210, 211, 212, 214, 215, 218, 219, 221, 224, 226, 227, 229, 230, 231, 232, 234, 236, 237, 275 artificial, 238 artificial intelligence, 238 arytenoid, 93, 94 Asian, 128 assemblage, 116 associate(d), 13, 15, 26, 27, 28, 38, 40, 41, 51, 52, 54, 56, 61, 107, 119, 123, 138, 142, 149, 154, 170, 175, 183, 240, 241, 245, 246, 247, 249, 265, 272, 274, 275, 276, association, 40, 41, 49, 56, 60, 61, 159, 174, 245, 247 asymmetry, 40, 47, 48, 49, 50, 51, 52, 56, 59, 60, 61, 62, 63, 66, 165, 182 asynchronous, 191, 200, 201, 202

attention, 17, 31, 53, 57, 67, 70, 71, 72, 73, 74, 79, 80, 81, 82, 83, 85, 86, 130, 159, 163, 166, 171, 172, 174, 176, 177, 180, 186, 188, 189, 240, 245, 261 attentional, 8, 75, 84, 85, 160, 166, 170, 171 attribution, 161 audiovisual, 34, 38, 213, 225, 228, 229, 231, 234, 237 auditory, 1, 11, 28, 51, 52, 53, 56, 59, 62, 118, 121, 123, 128, 133, 136, 138, 147, 150, 152, 153, 207, 210, 212, 214, 216, 226, 227, 228, 229, 231, 233, 234, 236, 237, 243 Australia, 130 Australian, 127 australopithecines, 142 autism, 82, 84, 171, 174, 176, 177 automatic, 128, 168, 172, 173, 176, 253, 263, 264, 265, 268, 282 automatically, 265 automatism, 154, 266, 274, 279 aux [auxiliary], 88, 89, 103, 105, 261, 270, 272, 278 auxiliary, 267, 275 avian, 3, 4, 28 avoidance, 26 aware, 266, 267, 273 awareness, 4, 75, 83, 124, 273 B babbling, 1, 7, 49, 64, 133, 139, 140, 141, 142, 144, 154, 155, 156, 207, 208, 210, 219, 224, 225, 234, 236, 264, 267 baby, 9, 10, 44, 49, 64, 65, 70, 74, 75, 83, 90, 101, 105, 142, 143, 207, 209, 210, 211, 218, 232, 237, 268 baby robot, 105, 207, 209, 210, 232, 237 baby talk, 143 back, 90, 139, 140, 220, 224, 225

backbone, 89 bad language, 271 barriers, 77, 79 basal ganglia, 151, 268, 269, 274, 282 basion, 89 behavior, 31, 32, 36, 38, 43, 45, 53, 54, 55, 58, 67, 69, 71, 73, 75, 76, 77, 80, 82, 84, 86, 135, 156, 162, 170, 171, 175, 176, 182, 208, 210, 215, 218, 226, 227, 229, 235, 252, 257, 280 bias, 1, 8, 51, 52, 54, 61, 215 bidirectional, 84 bilateral, 116, 170, 175 bimanual, 55, 61, 64 bimodal, 11, 84, 186, 189, 191, 194, 195, 200, 235, 237 binding, 177, 252 biomechanical, 133, 139, 140, 141, 144, 145, 148, 153 bipedal, 77, 81 biphasic, 135, 143, 148 bird, 15, 19, 57, 119, 120, 190 birth, 7, 84, 87, 93, 96, 102, 105, 142, 207, 209, 211, 231, 236 blind, 84, 236 body, 10, 17, 50, 94, 107, 109, 119, 134, 139, 151, 152, 162, 165, 183, 188, 214, 249, 280 bolthole, 15, 18, 22 bonding, 74, 136, 138, 159, 168 brachio-manual, 31, 40, 42, 124 brain, 10, 32, 38, 41, 42, 43, 47, 48, 49, 50, 56, 59, 60, 61, 62, 63, 64, 83, 107, 108, 109, 110, 112, 113, 115, 118, 120, 121, 124, 128, 129, 131, 135, 146, 156, 159, 160, 162, 165, 167, 168, 169, 170, 173, 174, 175, 176, 177, 232, 243, 257, 261, 262, 269, 275, 276, 278, 279, 280, 281 breeding, 5, 15 Buang, 1, 6

C calcium-binding, 167, 176 call, 1, 3, 4, 5, 8, 13, 14, 15, 16, 17, 18, 19, 22, 24, 26, 28, 51, 53, 58, 108, 116, 161, 168 calretinin, 167, 176 canonical, 134, 138, 147, 192, 207, 210, 219, 224, 225, 254 Cantonese, 272, 279 capacity, 29, 32, 33, 34, 40, 41, 42, 48, 67, 79, 108, 111, 118, 124, 135, 137, 139, 143, 147, 151, 153, 157, 176, 229, 231, 243, 244, 255, 263, 268, 275, 280 captive, 45, 54, 65, 68, 69, 74, 76, 77, 78, 79, 80, 81 captivity, 67, 69, 71, 74, 77, 81, 263 capuchin monkey, 85 caregiver, 78, 80, 136 cat, 57, 252, 259 causality, 78, 171, 257 cavity, 87, 88, 90, 91, 93, 98, 99, 101, 102, 104 Cebus apella, 27, 85 Cebus apella nigritus, 27 Cebus capucinus, 14 cell, 165 central, 139, 141, 148, 154, 212, 222 central nervous system, 212 central pattern generator, 148, 154 Cercopithecus aethiops, 14 cerebellar, 169 change, 19, 24, 53, 60, 63, 65, 67, 72, 104, 112, 113, 116, 118, 119, 120, 133, 135, 136, 137, 140, 141, 143, 144, 148, 155, 159, 162, 235, 245 channel, 133, 136, 207, 214, 252 chatter, 124, 149 chewing, 115, 116, 124, 135, 138, 145, 149, 155, 264

Subject index

child, 6, 9, 57, 58, 65, 84, 106, 108, 109, 115, 118, 125, 172, 182, 185, 186, 189, 192, 194, 196, 198, 201, 205, 232, 235, 236, 240, 243, 254, 264, 273, 282 chimpanzee, 30, 44, 45, 47, 50, 54, 60, 62, 63, 69, 72, 74, 76, 79, 83, 85, 86, 89, 93, 102, 103, 104, 105, 107, 112, 165, 166, 175 Chimpsky, 1 Chinese, 114, 126 Chomskyan, 1 circular, 78, 81 clause, 250 clinical, 129, 156, 206, 235, 273, 283 close, 15, 18, 26, 38, 53, 74, 79, 113, 116, 134, 138, 183, 221, 264 closed class, 121, 124, 242, 243, 244, 247, 252, 253, 256, 257, 259 close-open, 116, 134, 138, 264 closing, 8, 113, 183, 188 closure, 36 clue, 141 coarticulation, 8, 11, 234, 236 cochlear, 157, 212, 213 coda, 113 code, 32, 33, 34, 124, 191, 247 coding, 32, 43, 125, 183, 185, 186, 191, 204, 234, 246, 255 cognition, 10, 64, 65, 66, 78, 79, 83, 85, 86, 159, 160, 162, 163, 173, 174, 175, 176, 205, 256, 258, 262, 277, 278, 282 cognitive, 30, 31, 44, 45, 50, 60, 61, 67, 74, 75, 78, 81, 83, 85, 108, 112, 113, 135, 155, 159, 160, 162, 164, 165, 167, 174, 177, 198, 207, 208, 225, 231, 239, 254, 256, 257, 258, 261, 262, 263, 269, 276, 281, 283

combination, 41, 111, 115, 162, 179, 182, 185, 189, 196, 201, 205, 211, 263, 275 combine, 41, 53, 119, 122, 179, 183, 203, 269 command, 7 communicate, 31, 48, 137, 146, 225, 270 communication, 4, 11, 28, 30, 31, 34, 40, 42, 43, 44, 45, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 67, 70, 73, 74, 78, 80, 81, 82, 83, 84, 85, 86, 102, 107, 109, 110, 112, 116, 117, 118, 120, 121, 123, 124, 125, 126, 127, 128, 129, 130, 133, 134, 136, 137, 138, 142, 146, 149, 150, 152, 154, 159, 160, 161, 162, 167, 169, 171, 172, 174, 175, 176, 177, 181, 188, 194, 203, 204, 205, 208, 209, 232, 234, 235, 236, 238, 261, 262, 263, 267, 268, 269, 273, 274, 278, 281 communicative, 5, 7, 29, 31, 32, 34, 35, 38, 40, 41, 42, 43, 47, 53, 54, 56, 58, 59, 60, 61, 71, 74, 75, 77, 79, 83, 86, 116, 119, 120, 121, 122, 123, 126, 130, 137, 138, 145, 146, 149, 150, 153, 154, 160, 161, 185, 186, 189, 191, 240 comparative, 2, 31, 53, 65, 66, 73, 84, 102, 157, 206, 263 compensation, 96, 222, 276 complementary, 184, 185, 190, 196, 198, 199, 201, 203 complementizer, 1 complex, 27, 30, 35, 41, 58, 59, 74, 107, 108, 112, 114, 118, 121, 123, 125, 127, 135, 139, 142, 143, 147, 151, 157, 162, 163, 171, 208, 232, 248, 249, 250, 268

295

296

Subject index

complexity, 2, 41, 49, 75, 81, 112, 143, 154, 156, 164, 169, 198, 249 component, 59, 60, 87, 116, 135, 136, 139, 145, 152, 160, 161, 162, 172, 176, 189, 211, 241, 248, 249, 250 comprehension, 49, 59, 60, 79, 83, 115, 121, 256, 257, 275 computation, 131, 160 computational, 130, 161, 207, 208, 209, 232, 239, 258 concatenate, 152 concatenation, 274 concept, 78, 161, 232, 251, 273 conceptual, 4, 5, 70, 161, 244, 249, 254, 256 conceptual/semantic, 249 conceptual-intentional, 161 conceptualization, 244, 257 conflict monitoring, 128, 173 conjoined, 252, 260 connection, 135, 136, 138, 169, 218, 266, 267, 268, 275 connectivity, 64, 164, 167, 176 consciousness, 273 consonant, 8, 114, 133, 134, 138, 139, 140, 141, 142, 143, 144, 148, 155, 156, 232, 266 conspecific, 2, 5, 51, 60 constraint, 140, 141, 144, 148, 257 constriction, 90, 92, 93, 97, 99, 216, 224, 237 construction grammar, 239, 241, 253, 255, 256 construction-based model, 239 constructive, 170 contact, 15, 53, 54, 168, 186, 215, 253 content-loaded, 179, 180, 181, 183, 189 context, 10, 14, 15, 16, 19, 21, 22, 26, 27, 50, 51, 53, 54, 55, 59, 60, 70, 72, 74, 75, 76, 79, 80,

81, 138, 143, 180, 183, 186, 190, 204, 216, 263, 265, 267, 271, 272 contextual, 266 contextually, 136, 180 continuity, 1, 5, 73, 135, 147, 170, 199, 240, 241 continuous, 96, 257 contra-lateral, 151 control, 4, 7, 8, 11, 30, 32, 38, 40, 41, 42, 43, 44, 45, 48, 49, 52, 59, 60, 61, 64, 71, 76, 90, 96, 99, 102, 112, 115, 116, 117, 118, 122, 123, 124, 125, 127, 128, 141, 144, 145, 146, 152, 153, 154, 159, 161, 162, 168, 170, 174, 176, 177, 208, 209, 210, 212, 213, 220, 231, 232, 233, 235, 236, 237, 268, 269, 270, 274, 279, 281, 282 convergence, 2, 42, 55 conversation, 170, 282 co-occurrence, 139, 140, 141, 143, 144, 148, 154, 156, 270, 275 cooperative, 15, 27, 131 coordination, 49, 61, 116, 131, 210 coordinative, 80 coproduction, 8 coronal, 139, 140, 141 cortex, 9, 10, 29, 32, 34, 36, 38, 40, 42, 43, 44, 45, 56, 63, 111, 116, 117, 118, 123, 128, 130, 145, 146, 149, 152, 154, 156, 159, 162, 163, 165, 166, 167, 168, 170, 171, 174, 175, 176, 177, 269, 273, 281 cortical, 32, 38, 40, 41, 42, 44, 45, 52, 59, 63, 64, 116, 117, 145, 146, 159, 161, 162, 163, 167, 168, 170, 177, 274, 279 coupling, 208 coverbal, 180 cranial, 104, 105 cranio-facial, 211

cries, 168 crossmodal, 180, 182, 183, 190, 191, 194, 196, 198, 199, 200, 201, 203 cross-sectional, 72, 84, 90, 97, 102 crosstalk, 1, 3, 7, 9 crowned eagle, 62 cry, 168 crying, 268, 281 cue, 7, 26, 243, 253 culture, 10, 78, 81, 83, 85, 86, 104, 109, 131, 147, 160, 175, 205 cursing, 265, 270 cycle, 8, 113, 115, 135, 137, 138, 143, 148, 210, 264 cyclicity, 115, 138, 145, 150, 152 Cynomis guinnisoni, 14 cytoarchitectonic, 164 cytoarchitecture, 163 D Darwinian, 82, 134, 136 dative, 249 deaf, 5, 42, 44, 48, 121, 125, 126, 133, 204, 205, 206 declarative, 7, 57, 58, 73, 74, 75, 76, 79, 80, 171, 172 deficit, 74, 174, 214, 276 degeneration, 276 degenerative, 277, 280 degrees of freedom, 120, 126, 207, 221 deictic, 1, 3, 7, 78, 179, 180, 181, 182, 183, 184, 185, 186, 188, 189, 190, 193, 194, 195, 203 deixis, 1, 3, 6, 8, 9, 11, 67, 69, 70, 80, 180, 251, 257 delay, 16, 18, 19, 22, 24, 201 delayed, 17, 19, 192, 198, 205 delivery, 77, 79 demonstrative, 1, 6, 188 descent, 82, 134, 135, 148, 154 desirable, 77, 79, 80

detection, 128, 173, 177, 207, 212, 237 development, 11, 14, 41, 44, 45, 49, 50, 57, 62, 64, 65, 66, 67, 69, 74, 76, 77, 78, 80, 81, 82, 83, 85, 86, 103, 105, 112, 117, 118, 119, 123, 125, 126, 127, 131, 137, 141, 156, 160, 172, 176, 179, 180, 182, 189, 191, 194, 196, 200, 201, 203, 204, 206, 207, 208, 209, 213, 225, 231, 235, 236, 239, 244, 249, 256, 257, 258, 262, 263, 264, 270, 271, 273, 274, 277, 278, 280 developmental, 2, 8, 50, 67, 72, 73, 78, 82, 85, 104, 109, 114, 115, 143, 171, 177, 180, 181, 182, 183, 184, 186, 190, 192, 194, 198, 199, 201, 202, 204, 206, 208, 209, 219, 224, 232, 235, 237, 254, 256, 276 device, 7, 8, 138, 203 dexterity, 113, 115, 116, 117, 148 Diana monkey, 2, 28, 62, 65, 66 direct, 53, 54, 57, 70, 73, 74, 79, 171, 180, 186, 189, 209, 240 directed, 17, 32, 33, 54, 58, 119, 152, 160, 161, 170, 244 direction, 13, 14, 15, 16, 17, 18, 22, 24, 26, 32, 51, 52, 72, 73, 79, 214 directional, 3, 125 directionality, 2 directly, 54, 77, 79 director, 188 dirty talk, 270, 274 dirty word, 270, 281 discharge, 34, 56, 123, 145, 149 disconnection, 160 discontinuity, 212 discourse, 239, 241, 249, 250 discrete, 7, 13, 14, 26, 115, 131, 148, 226

Subject index

discrimination, 44, 51, 163, 214, 243, 247 disease, 166, 276, 280 disorder, 174, 268, 276, 281 dispersion, 223, 237 displacement, 70 dissociation, 120, 121, 266, 282 distal, 7, 45 distress, 2 distribution, 63, 97, 193, 207, 217, 218, 227, 229 dominance, 8, 61, 63, 156, 236 dorsal, 8, 139, 140, 163, 164, 165, 167, 171 dorsum, 211, 221, 225, 226 dual-event, 248 dyad, 133, 136 dyadic, 31, 40, 136, 137, 142 dynamic, 64, 75, 253, 281 dynamics, 64, 107, 258 dysfunction, 166, 174, 177, 268 dysgranular, 38 E early language, 7, 125, 131, 179, 185, 189, 190, 203, 206, 242 EEG, 40, 171, 176 elbow, 151 element, 28, 133, 138, 152, 161, 169, 179, 180, 181, 182, 184, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 199, 200, 203, 241, 242, 243, 244, 245, 246, 249, 250, 251, 252, 253, 257, 265, 271, 274, 276 embedded, 2, 216, 257, 278 embodiment, 129, 134, 207, 208 emergence, 3, 11, 12, 29, 40, 47, 48, 61, 62, 64, 65, 81, 86, 110, 112, 113, 118, 120, 123, 125, 126, 127, 130, 156, 160, 166, 167, 172, 205, 206, 232, 238, 262, 263, 270, 271, 272, 273, 274 emergency, 3

emission, 42, 43, 167, 272 emit, 15, 26, 29, 32 emitter, 2 emotion, 59, 66, 83, 117, 163, 165, 173, 175, 265, 268 emotional accompaniment, 271 emotional approach, 176 emotional arousal, 161 emotional aspect, 268 emotional behavior, 30, 38, 123 emotional bonding, 74, 79, 136 emotional consequence, 80 emotional cries, 168 emotional engagement, 80 emotional evaluation, 170 emotional exchange, 80 emotional expression, 166, 275 emotional feeling, 271 emotional force, 270 emotional influence, 173, 174 emotional intonation, 169 emotional involvement, 31 emotional language, 268 emotional responsiveness, 77, 79 emotional self-control, 166 emotional situation, 59 emotional state, 14, 30, 31, 62, 80, 113 emotional surge, 271 emotional valence, 52 emotional vocalization, 168, 170 emphasis, 48, 116, 147, 148, 152, 183, 240, 271 encode, 62, 113, 258 encoding, 34, 136, 173, 203, 244, 279 endogenous, 77, 216 endowed, 31, 32, 34, 36, 38, 40, 41, 57, 214, 231 endowment, 240 English, 1, 6, 47, 85, 103, 110, 114, 115, 126, 147, 157, 225, 243, 254, 256, 259, 275

297

298

Subject index

enhance, 16 enrich, 42, 112, 126 entity, 5, 67, 70, 81 entropy, 217 environment, 14, 15, 26, 63, 69, 70, 81, 120, 160, 180, 188, 216, 242, 273, 274 environmental, 69, 76, 84, 136, 166 epigenetic, 80, 81, 160 epileptic, 171 equipped, 159, 173, 240 ERP, 60, 177 error, 85, 128, 165, 173, 215, 228, 249, 251 eulaminate, 167 European, 1, 11, 42, 43, 62, 125, 128, 130, 154, 204, 233, 238, 255, 272 evaluative, 128, 173, 174 event, 14, 16, 17, 18, 19, 22, 24, 30, 53, 57, 58, 62, 67, 70, 71, 72, 74, 75, 76, 79, 80, 110, 119, 138, 146, 171, 207, 212, 213, 237, 239, 240, 243, 244, 245, 246, 248, 249, 250, 252, 253, 256, 258, 259, 260 evolution, 1, 7, 10, 11, 13, 14, 26, 28, 29, 30, 31, 36, 38, 40, 41, 42, 43, 44, 47, 48, 50, 58, 61, 62, 63, 64, 65, 69, 80, 81, 82, 83, 85, 87, 88, 101, 104, 107, 109, 110, 111, 112, 114, 115, 116, 117, 118, 121, 122, 123, 126, 127, 128, 129, 130, 131, 133, 134, 135, 136, 137, 138, 139, 140, 141, 143, 144, 146, 147, 148, 150, 151, 152, 153, 154, 155, 156, 157, 159, 160, 161, 162, 163, 167, 173, 175, 176, 207, 212, 236, 237, 239, 257, 261, 262, 263, 264, 265, 267, 269, 270, 271, 272, 273, 275, 277, 279, 280, 281, 282, 283 evolution of aphasic lexical speech, 261

evolution of human communication, 261, 262 evolution of language and speech, 261, 283 evolution of speech, 43, 44, 104, 114, 116, 128, 130, 133, 134, 140, 155, 236, 261, 264, 277, 281 evolutionary, 2, 7, 29, 30, 41, 42, 48, 54, 59, 61, 67, 81, 87, 102, 109, 111, 114, 115, 116, 117, 118, 119, 122, 124, 125, 126, 129, 135, 136, 137, 154, 160, 162, 174, 177, 225, 233, 261, 262, 263, 268, 270, 275, 276, 277, 278 evolve, 10, 38, 44, 109, 124, 152, 153, 167, 175, 177, 208, 257, 267, 272, 276, 280 exaptation, 160 exchange, 57, 80, 138, 271 execute, 34, 40 execution, 29, 32, 33, 34, 35, 38, 56, 111, 122, 123, 145, 149, 277 executive, 32, 152, 165, 174, 177 exogenous, 69, 77, 216 expanding spiral, 110, 117, 118, 121, 126, 127 expiration, 113 expiratory, 113 expletive, 270, 271, 275, 278 expression, 5, 14, 31, 40, 52, 69, 121, 123, 125, 129, 162, 167, 190, 249, 251, 266, 268, 270, 276, 280 expressive, 117, 128, 205, 239, 240, 265, 267, 271 extension, 55, 72, 120, 122, 148, 150, 232 external, 5, 14, 26, 30, 145, 151, 152, 216, 229, 231 extra meaning, 239 extraction, 8, 11, 253, 257 extrinsic, 94, 151 eye, 8, 10, 54, 84, 163, 172, 186, 234, 236, 237

F F/C [Frame/Content], 115, 133, 136, 137, 138, 139, 143, 145, 146, 149, 151, 152, 153 face, 36, 41, 43, 52, 71, 74, 76, 77, 120, 126, 128, 155, 169, 209, 213, 216, 225, 263 face gestures, 128 face processing, 155 face-to-face, 76 facial, 31, 40, 45, 52, 63, 66, 110, 112, 121, 124, 125, 126, 127, 128, 150, 167, 183, 231, 276 facial praxis, 276 father, 117, 154, 186 fear, 79, 165, 272, 275 feature, 48, 160, 180, 190, 244, 263, 268, 274 fecal, 4 feces, 4 feed-back, 9, 167 feed-forward, 167 feeding, 38, 51, 55, 64, 85, 110, 124, 135 female, 73, 79, 92, 99, 101, 106, 186, 212 filtering, 62 finger, 71, 72, 76, 85, 86, 180, 183, 188, 189 fire, 109, 119 fissure, 51 fixed, 94, 107, 112, 113, 118, 124, 125, 127, 216, 226, 241, 251, 254 FLB [faculty of language broad], 161 flee, 3 flexibility, 30, 112, 118, 125, 127, 241 flexible, 31, 41, 88, 93 flexion, 105, 148, 151 fMRI, 11, 40, 42, 43, 170, 174, 175, 177, 257 focalization, 237 focus, 5, 8, 11, 107, 109, 113, 118, 146, 161, 162, 167, 182, 192,

202, 208, 216, 229, 240, 241, 249, 250, 254, 257, 261, 262, 281 food, 4, 13, 14, 15, 26, 28, 38, 51, 53, 54, 63, 72, 75, 78, 123, 135, 137, 188 foot, 7, 9, 10 foraging, 3, 4, 6, 18, 24, 142 foramen, 89 forebrain, 168, 177 foreign, 4, 5, 213 formant, 91, 97, 105, 211, 212, 216, 219, 221, 224, 226, 228 FormToMeaning, 245, 246, 247, 248, 251, 252 formulaic, 263, 264, 265, 267, 268, 269, 270, 274, 278 fossil, 9, 104 fossilised, 88, 262, 267, 270 FOXP2, 276, 280, 281 fractionating, 255 fractionation, 110 frame, 7, 17, 38, 44, 113, 114, 115, 116, 130, 133, 137, 138, 143, 145, 148, 151, 152, 155, 156, 191, 232, 236, 269, 281 frame aphasia, 269 Frame/Content theory, 133, 264 framework, 10, 87, 107, 129, 160, 174, 185, 207, 208, 213, 218, 226, 229, 232, 233, 240, 241, 244, 251, 253, 254, 256, 278, 282 framing, 219, 221, 226 free, 28, 44, 62, 65, 77, 82, 86, 125, 147 freeing, 126 free-living, 44, 86 free-ranging, 28, 62, 65, 82 French, 2, 7, 11, 42, 62, 97, 105, 205, 210, 233, 236, 237, 243 frequency, 18, 49, 54, 60, 72, 90, 91, 94, 98, 106, 150, 194, 198, 201, 211, 212, 216, 219, 224, 226, 243, 265, 266, 272

Subject index

front, 90, 93, 99, 139, 140, 141, 214, 224, 225 frontal, 8, 10, 11, 32, 40, 42, 44, 146, 163, 165, 171, 172, 175, 176, 257, 261, 262, 263, 264, 269, 274, 277, 281 fronted, 99, 220, 223 fronting, 90, 99, 225 frontolimbic, 281 frontostriatal, 279 function, 5, 6, 8, 14, 26, 29, 30, 32, 45, 57, 58, 59, 62, 64, 72, 82, 84, 130, 134, 135, 136, 138, 144, 149, 154, 165, 167, 173, 174, 177, 185, 188, 189, 190, 194, 204, 207, 211, 215, 226, 227, 244, 246, 253, 258, 261, 265, 267, 269, 270, 271, 275 functional, 2, 11, 26, 28, 29, 31, 42, 43, 47, 48, 50, 51, 53, 56, 57, 59, 81, 115, 135, 159, 164, 165, 167, 172, 174, 240, 241, 242, 249, 254, 258, 270, 276, 279 functionalist, 7, 240

149, 150, 151, 152, 153, 155, 159, 161, 170, 172, 179, 180, 181, 182, 186, 188, 189, 190, 191, 193, 194,195, 196, 198, 200, 202, 203, 205, 261, 262, 269, 271, 276, 277 gestural accompaniment, 125 gestural activities, 49 gestural acts, 48 gestural behavior, 75 gestural combination, 190 gestural communication, 29, 31, 33, 45, 47, 48, 49, 54, 55, 56, 61, 64, 65, 70, 74, 84, 86, 125, 136, 159, 161, 170, 172, 269, 271 gestural communicative system, 38, 40 gestural component, 81, 170, 262 gestural deixis, 179, 194, 198, 203 gestural element, 196 gestural iconicity, 153 gestural linguistic precursor, 152 G gestural medium, 3, 133, 153 Gallo-Romance, 7 gestural modality, 180, 194, Gallus g. domesticus, 14 203 gaze, 10, 13, 14, 15, 16, 17, 18, 22, gestural origin, 47, 48, 57, 60, 24, 26, 28, 54, 74, 75, 84 64, 130, 133, 146 gene, 69, 105, 160, 276 gestural phylogeny, 153 generation, 48, 61, 145, 152, 165, gestural predication, 4 169, 175, 255 gestural reference, 3, 4 generative, 63, 140, 155, 240, gestural repertoire, 40, 85 241, 257, 262 gestural signal, 45, 48, 53, 86 generativity, 43, 48 gestural subcomponent, 146 genetic, 69, 275, 276, 279 gestural system, 61, 129 genetically, 240 gestural-vocal, 181, 182, 186, genetics, 9 191, 193, 200 German, 6, 144, 257, 280 gestural-vocal productions, gestural, 3, 4, 29, 31, 33, 38, 40, 182 45, 47, 48, 49, 53, 54, 55, 56, gestural-vocal system, 182 57, 59, 60, 61, 64, 65, 70, 74, gesture, 3, 5, 6, 8, 29, 31, 32, 34, 75, 81, 84, 85, 86, 113, 125, 38, 40, 41, 45, 47, 50, 54, 63, 129, 130, 133, 136, 146, 147, 64, 65, 67, 70, 71, 72, 73, 75,

299

300

Subject index

84, 115, 119, 121, 123, 124, 125, 126, 128, 129, 130, 147, 149, 154, 180, 181, 182, 183, 184, 185, 186, 189, 191, 192, 193, 194, 199, 203, 204, 205, 213, 216, 261, 262, 263, 269, 275, 277, 281 gesturing, 3, 48, 73, 74, 80, 129, 180 gibbon, 165 gill arches, 135 gleeful vocalizations, 80 glottis, 94, 216 goal, 9, 29, 31, 32, 33, 34, 40, 72, 73, 99, 115, 119, 120, 160, 161, 239, 244, 258 goal-directed, 32, 160 goal-seeking, 115 gorilla, 45, 85, 165 Gorilla gorilla, 85 Gothic, 6 govern, 249 governance, 152 gradual, 31, 40 gradually, 50, 139, 254, 271 grammar, 6, 10, 11, 85, 155, 170, 183, 240, 241, 249, 257, 258, 275, 282 grammatical deixis, 239, 240, 241 grammaticalization, 3, 6, 9 granular, 163 grasp, 43, 44, 65, 112, 129, 130, 131, 156, 177 grasping, 31, 33, 38, 45, 49, 107, 111, 112, 118, 119, 123, 124, 149 great apes, 10, 43, 50, 56, 59, 63, 67, 83, 85, 112, 122, 173, 176 Greek, 1 grooming, 36, 115, 138, 274 ground, 2, 5, 14, 26, 30, 61, 124, 125, 232, 241, 278 grounding, 111, 115, 121, 270, 275

growth, 93, 94, 101, 103, 105, 207, 208, 210, 211, 233, 236, 237 guard, 13, 15, 17, 24 guidance, 65 gyrus, 11, 40, 117, 165, 174, 177, 257

hominins, 129, 130, 154, 170, 175 hominoidea, 67 Homo erectus, 110, 124, 135 Homo habilis, 110, 124, 163 Homo sapiens, 84, 85, 107, 109, 110, 112, 121, 124 homologous, 111, 118 H homologue, 56, 63 habitat, 26 homology, 29, 38, 42, 129 hair, 4 hostility, 271 hand, 3, 8, 31, 33, 34, 38, 40, 41, human, 1, 2, 3, 4, 5, 6, 7, 10, 29, 42, 43, 47, 49, 50, 54, 55, 56, 30, 31, 36, 38, 40, 42, 43, 44, 60, 61, 63, 64, 65, 71, 72, 73, 47, 48, 49, 50, 52, 53, 54, 56, 75, 79, 83, 84, 111, 112, 115, 57, 58, 59, 60, 62, 63, 64, 65, 118, 119, 120, 122, 123, 124, 66, 67, 70, 72, 73, 74, 75, 76, 126, 128, 129, 131, 149, 151, 77, 78, 79, 80, 81, 83, 84, 85, 157, 161, 171, 175, 183, 188, 86, 87, 88, 91, 93, 102, 103, 204, 210, 280 104, 105, 107, 109, 110, 111, handedness, 43, 48, 50, 54, 56, 112, 113, 115, 116, 117, 118, 120, 60, 61, 63, 64, 175, 269, 280 122, 125, 126, 128, 129, 131, handicapped, 87 151, 154, 155, 159, 161, 162, hatred, 270 163, 165, 168, 169, 170, 171, head, 10, 16, 17, 24, 49, 64, 89, 173, 174, 175, 176, 177, 180, 129, 130, 188, 241, 249, 250 194, 203, 209, 210, 232, 235, height, 14, 17, 93, 94, 207, 211, 236, 249, 253, 255, 256, 262, 213, 221, 225, 226 263, 267, 269, 271, 274, 277, Helmholtz resonator, 90 278, 279, 281 hemisphere, 48, 50, 51, 52, 55, humans, 1, 2, 3, 5, 9, 27, 30, 32, 56, 59, 60, 63, 64, 121, 164, 40, 42, 43, 47, 48, 49, 50, 51, 262, 265, 268, 270, 271, 274, 52, 55, 56, 57, 58, 60, 61, 65, 279, 281 66, 67, 69, 71, 72, 73, 74, 75, hemispheric, 48, 50, 52, 60, 63, 76, 77, 78, 79, 80, 81, 83, 85, 65, 66, 269 103, 104, 105, 109, 111, 113, heterochrony, 78 116, 117, 121, 122, 126, 128, hierarchical, 162, 170, 244, 249, 137, 149, 151, 159, 162, 163, 250, 252, 254 165, 166, 167, 169, 170, 173, hierarchy, 76 176, 262, 263, 270, 271, 273, history, 48, 74, 75, 76, 77, 79, 83, 275, 282 106, 109, 121, 126, 133, 134, 135, 141, 143, 155, 160, 163, I 263, 282 iconic, 45, 81, 124, 146, 153, holistic, 131, 137, 263, 274, 282 183, 189 holistically, 266, 274 iconics, 180 hominids, 4, 80, 104, 115, 144, idiom, 241, 254 145, 146, 152, 176 idiomatic, 274

image, 48, 105, 153, 209, 264 imagery, 44, 243 imaging, 8, 38, 42, 50, 61, 62, 64, 76, 103, 105, 146, 173, 175, 177 imitation, 3, 44, 45, 79, 84, 102, 104, 105, 107, 108, 111, 112, 118, 119, 120, 123, 124, 127, 129, 156, 161, 207, 209, 210, 216, 225, 226, 228, 229, 231, 235, 236 immature, 160 immaturity, 77, 81, 275, 279 immediate, 4, 17, 54, 70, 119, 135 impairment, 213, 236, 261, 275, 276, 279, 282 impedance, 137 imperative, 7, 57, 58, 72, 73, 74, 76, 79, 80, 171, 172 implant, 157 inability, 81, 102, 119, 151 index, 8, 56, 71, 72, 73, 76, 85, 86, 93, 94, 180, 183, 188, 189, 242, 248, 251, 253 indexical, 5 individual, 14, 15, 16, 17, 19, 21, 24, 26, 31, 32, 33, 34, 40, 53, 54, 56, 57, 60, 61, 71, 75, 94, 112, 117, 123, 133, 141, 145, 161, 165, 171, 176, 181, 191, 192, 194, 201, 202, 203, 265, 266, 267 infant, 4, 49, 58, 63, 64, 77, 78, 82, 83, 85, 105, 110, 112, 115, 117, 133, 136, 139, 140, 141, 142, 143, 144, 157, 168, 169, 170, 176, 219, 231, 244, 254, 256, 263, 273 inference, 227, 236, 238 infinite, 41 inflationist, 6 inflection, 4, 275 information structure, 239 ingestive, 29, 34, 43, 117, 118, 122, 123, 130, 145, 149, 150, 152, 154, 264

Subject index

inhibit, 271 inhibition, 79, 176, 213, 269, 271 inhibitory, 172, 269 initial, 6, 114, 138, 141, 142, 149, 153, 201, 202, 207, 240, 245, 267 initiate, 151 initiating, 171, 172 initiation, 6, 117, 142, 159, 168, 169, 172, 265, 268, 269 innate, 117, 169, 240 in-phase, 8 input, 8, 34, 96, 97, 123, 150, 152, 160, 169, 207, 211, 212, 213, 242, 244, 246, 247, 248, 249, 269 inspiration, 113 instrument, 170 instrumental, 170 integrate, 214, 232 integration, 11, 111, 120, 163, 167, 170, 203, 208, 209, 237 integrative, 172 intelligence, 3, 10, 65, 78, 82, 85, 105, 162, 235 intelligent, 6, 236 intentional, 2, 30, 31, 42, 47, 49, 50, 51, 53, 54, 55, 56, 59, 60, 61, 67, 70, 78, 82, 83, 84, 85, 128, 159, 160, 161, 162, 170, 172, 175, 266 intentionality, 3, 159, 160, 171, 173, 270 interaction, 33, 69, 76, 77, 78, 111, 119, 125, 131, 136, 143, 172, 175, 207, 208, 209, 241, 244, 245, 262, 282 interconnection, 167 interconnectivity, 167, 168, 175 interface, 173, 176 intermediary, 108 intermediate, 94, 115, 226, 227, 272 internal, 33, 112, 120, 135, 156, 160, 216, 229, 248, 266, 270

interpret, 52 interpretation, 75, 79, 103, 172, 180, 183, 249, 256 interrogative, 57, 58 intertwined, 117, 118, 119 interweaving, 9 intonation, 6, 8, 266, 267 intonational, 8 intrinsic, 94, 146, 151, 152 inversion, 11, 105, 218, 221, 226, 227, 228, 229, 234, 235 involuntarily, 145 ion, 276 irritative lesions, 145 Italian, 42, 181, 183, 188, 204, 206, 280 J Japanese, 51, 65, 84, 104, 114, 126, 254, 256 Japanese macaque, 51, 65 jaw, 36, 87, 94, 101, 115, 136, 137, 207, 210, 211, 213, 214, 221, 225, 237 jaw closure, 36 jaw opening, 36 joint attention, 81, 82, 83, 84, 159, 171, 172, 173, 176, 240 juvenile, 69, 79, 143 K Kanzi, 58, 73, 115 kin terms, 156 kinesthetic, 112 L labial, 139, 140, 141, 234 LAD [language acquisition device], 240 lamprey, 148 Landzert angle, 89 language, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 14, 27, 29, 30, 31, 32, 34, 40, 41, 42, 43, 44, 47, 48, 49, 50, 53, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 71, 73,

301

302

Subject index

76, 78, 80, 81, 82, 83, 85, 86, 103, 104, 106, 107, 108, 109, 110, 111, 113, 114, 115, 116, 117, 118, 119, 121, 122, 125, 126, 128, 129, 130, 131, 133, 135, 136, 137, 138, 139, 141, 144, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 159, 161, 167, 169, 170, 171, 174, 175, 176, 180, 181, 188, 199, 203, 204, 205, 206, 208, 209, 216, 225, 232, 233, 234, 235, 237, 238, 239, 240, 241, 242, 243, 244, 249, 253, 254, 255, 256, 257, 258, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283 language acquisition, 11, 83, 157, 204, 205, 239, 240, 241, 244, 253, 256, 258 language acquisition device [LAD], 240 language development, 82, 171, 199, 203, 206, 258, 281 language evolution, 2, 3, 6, 8, 10, 29, 30, 31, 40, 41, 47, 107, 110, 113, 114, 130, 153, 262, 271 language origin, 3, 61, 64, 152, 155 language readiness, 107, 109, 119, 122,150 language ready, 109, 135 laryngeal, 8, 38, 44, 93, 94, 115 larynx, 87, 88, 89, 91, 93, 94, 102, 103, 104, 116, 177, 207, 211 late, 78, 82, 114, 135, 142 later, 80, 81, 114, 137, 161, 170, 171, 172, 193, 194, 196, 198, 203, 210, 273 laterality, 52, 54, 60, 61, 65, 66 lateralization, 48, 52, 59, 64, 65, 66

laughter, 58, 268, 281 layer, 163, 165, 166, 167 learning, 31, 65, 85, 86, 130, 131, 144, 145, 180, 206, 207, 209, 210, 216, 218, 227, 228, 229, 231, 237, 240, 242, 244, 245, 246, 247, 248, 249, 253, 255, 256, 257 left, 8, 11, 40, 48, 49, 50, 51, 52, 55, 56, 59, 60, 61, 63, 64, 121, 165, 170, 171, 257, 261, 262, 264, 265, 269, 271, 274, 279 leopard, 62 lesion, 51, 170, 261, 262, 276, 282 level, 3, 14, 15, 18, 26, 28, 32, 73, 75, 76, 78, 113, 114, 115, 136, 137, 138, 139, 141, 151, 159, 191, 194, 195, 209, 210, 237, 245, 252, 254, 274 lexical, 6, 141, 145, 146, 152, 153, 156, 239, 241, 242, 245, 246, 250, 252, 254, 256, 257, 258, 262, 264, 267, 270, 272, 276 lexical categorization, 239, 242 lexical openness, 145, 146, 152 lexical speech automatism, 264, 270 lexicon, 70, 142, 242, 271, 281, 282 lexigrams, 58, 115 licking, 34, 115, 135, 145, 264 limbic, 4, 44, 162, 163, 164, 168, 170, 174, 175, 268, 273, 274, 281 linear, 94, 211, 215, 242 linguistic, 2, 3, 4, 5, 8, 53, 57, 63, 81, 111, 112, 128, 129, 143, 146, 147, 151, 153, 156, 157, 159, 162, 167, 169, 173, 180, 185, 188, 205, 206, 232, 233, 241, 244, 254, 255, 256, 264, 267, 268, 273, 279, 281 lip, 93, 94, 99, 116, 122, 123, 139, 149, 150, 207, 211, 213, 216, 221, 224, 225, 226, 230, 236

listening, 30, 34, 114 localization, 13, 14, 26, 111, 177, 237 localize, 2, 9, 13 location, 4, 5, 6, 13, 14, 16, 18, 19, 22, 24, 26, 67, 74, 85, 89, 147, 188 locative, 1, 6, 7, 188 locomotion, 77, 80, 115, 148, 154 locomotor, 77, 78, 81 locus, 146 logic, 167, 258 logical, 135 longitudinal, 45, 94, 172, 181, 182, 183, 186, 196, 201, 202, 203, 207, 212, 277 longitudinally, 179, 182 look, 7, 19, 22, 24, 26, 32, 57, 65, 74, 146, 188, 191, 200, 267, 270 LSA [lexical speech automatism], 265, 266, 267, 269, 270, 271, 273, 275, 278 lust, 271 M Macaca fuscata, 65 Macaca mulatta, 14, 28, 44, 63 Macaca nemestrina, 44 macrostructural, 159, 164 male, 42, 88, 94, 101, 106, 142, 212 mammalian, 3, 28, 113, 135, 165, 174, 264 mand, 58 mandible, 134, 135, 148, 209, 221, 264 mandibular, 7, 38, 115, 116, 135, 136, 137, 138, 143, 144, 145, 148, 152, 264 manipulation, 13, 65, 81, 82, 85, 159 manual, 31, 40, 47, 48, 49, 53, 54, 55, 56, 64, 67, 69, 76, 79, 80, 81, 109, 110, 111, 112, 113,

114, 116, 118, 122, 123, 124, 125, 126, 127, 128, 129, 136, 138, 139, 145, 147, 148, 149, 150, 151, 152, 154, 157, 170, 189, 263 manual deixis, 67, 69, 79, 80, 81 manual gesture, 40, 47, 53, 54, 64, 67, 76, 81, 114, 116, 122, 123, 124, 126, 128, 139, 148, 150, 170, 189, 263 map, 56, 112, 219, 226, 227, 228, 231, 248, 250, 251 mapping, 176, 224, 240, 242, 244, 245, 246, 247, 248, 251, 252, 253, 255, 256, 276 master controller, 134 match, 33, 85, 101, 126, 209, 227, 228, 231, 248 matching, 33, 41, 92, 137 maximize, 97, 217 maximum, 201, 217 meaning, 3, 10, 16, 28, 31, 32, 34, 35, 41, 75, 111, 113, 117, 119, 120, 121, 123, 124, 125, 127, 133, 137, 142, 146, 147, 153, 155, 170, 171, 177, 180, 181, 183, 184, 189, 203, 217, 237, 239, 240, 241, 242, 244, 246, 248, 249, 250, 251, 252, 253, 254, 255, 257, 265, 266, 267, 270, 280 meaningful, 29, 32, 42, 50, 51, 70, 75, 112, 168, 183, 203 mechanics, 75 mechanism, 41, 42, 57, 161, 240, 241, 248, 269 medial, 56, 152, 163, 164, 171, 174, 176, 177 medial frontal, 171, 176, medial prefrontal, 165, 174, 177, mediodorsal, 162, 163 mediofrontal, 117, 168 medium, 3, 16, 17, 18, 19, 22, 24, 26, 133, 135, 137, 146, 152, 153

Subject index

meerkat, 1, 2, 3, 4, 9, 13, 14, 15, 16, 17, 18, 24, 26, 28 MEG, 40 memory, 8, 10, 33, 159, 245, 247, 248, 251, 262, 269, 281 mental imagery, 38 mental rotation, 38 mental state, 57, 75, 160, 171 mentalizing, 170, 175 mesial, 30, 38, 117 metaphor, 128, 147, 162 metaphorical, 274 metaphorics, 180 metrical, 7, 113 mice, 60 microstructural, 159 midsagittal, 94, 99, 210, 211 migratory, 166 mimesis, 118, 123, 135, 151 mind, 2, 10, 43, 44, 47, 63, 65, 73, 82, 83, 85, 119, 129, 130, 131, 154, 156, 273 mirror, 29, 33, 34, 38, 40, 41, 42, 44, 47, 48, 56, 64, 107, 111, 112, 113, 117, 118, 119, 120, 122, 123, 124, 126, 129, 130, 145, 146, 149, 150, 151, 152, 153, 155, 161, 162, 264, 280 mirror neurons, 29, 33, 34, 38, 41, 44, 47, 48, 56, 64, 113, 122, 123, 130, 145, 146, 149, 150, 151, 152, 153, 154, 155, 161 mirror system, 40, 41, 42, 107, 111, 112, 113, 117, 118, 120, 124, 126, 129, 161, 280 missing link, 4 mob, 5 mobbing, 5 modality, 34, 57, 58, 118, 138, 179, 185, 189, 190, 202, 203, 205, 210 modification, 4, 134, 135, 148 modular, 277 modulated, 62, 113, 115 modulation, 60, 62, 115, 119 module, 212, 274

mole snake, 4 monitor, 54, 177 monitoring, 75, 165, 175, 279 monkey, 3, 28, 29, 30, 33, 34, 35, 38, 40, 41, 42, 43, 45, 48, 51, 56, 60, 62, 63, 65, 84, 111, 112, 116, 117, 118, 119, 123, 129, 130, 149, 150, 151, 154, 167, 174, 177 morphological, 87, 102, 212, 265 morphology, 87, 259, 262, 263 mother, 5, 73, 79, 117, 142, 154, 168, 169, 174, 268 motherese, 7, 125, 129, 130, 154, 169, 170, 175 motion, 33, 188 motivation, 74, 170, 171, 174, 216 motivational, 14, 70, 75, 80 motor, 8, 11, 29, 32, 33, 34, 40, 41, 43, 44, 45, 47, 50, 52, 54, 61, 64, 112, 115, 116, 117, 120, 121, 122, 123, 135, 142, 146, 149, 150, 151, 154, 156, 159, 160, 161, 163, 164, 166, 169, 170, 174, 176, 207, 208, 209, 210, 214, 216, 218, 220, 221, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 235, 236, 237, 264, 269, 270, 275, 276, 281 mouth, 8, 29, 33, 34, 38, 40, 41, 43, 49, 50, 61, 63, 64, 65, 83, 116, 123, 124, 128, 129, 130, 134, 138, 140, 149, 154, 175, 183, 188, 204, 214, 280 mouthing, 128 move, 14, 22, 24, 26, 91, 126, 210 movement, 15, 19, 32, 38, 77, 115, 119, 121, 126, 131, 140, 144, 147, 149, 163, 183, 188, 214, 262 MRI, 11, 50, 56, 104, 174

303

304

Subject index

multimodal, 81, 125, 128, 170, 180, 203, 208, 209 multimodality, 207, 208, 229

O O [Object], 241, 242, 244, 245, 249, 259 object, 2, 7, 14, 26, 27, 30, 31, 32, N 33, 34, 45, 49, 53, 54, 57, 58, Nahuatl, 6 65, 67, 70, 71, 72, 73, 74, 75, naming, 80, 81, 143, 156, 190, 76, 77, 78, 79, 80, 81, 82, 85, 203 108, 112, 119, 122, 146, 151, narrate, 253 160, 171, 183, 186, 188, 214, narrative, 180 244, 245, 253, 256, 258, 260, nasal, 93, 94, 113, 142, 143, 154, 274, 281 156, 236 object [O], 241, 242, 244, 245, nascent, 9, 73, 75 249, 259 Naturphilosophie, 144 objecthood, 2 Neandertal, 9, 87, 88, 89, 91, 93, observation, 29, 30, 33, 34, 40, 103, 104, 105 41, 42, 43, 79, 111, 119, 121, neotenous, 142 122, 123, 130, 149, 154, 192, neural, 5, 8, 9, 10, 29, 30, 32, 34, 194, 195, 198, 201, 209, 252, 43, 44, 57, 59, 60, 61, 62, 102, 271, 275 109, 111, 116, 117, 120, 121, observe, 33, 34, 40, 49, 59, 69, 123, 127, 129, 131, 136, 138, 101, 114, 120, 201, 249, 264 155, 159, 160, 161, 162, 171, observer, 34, 41, 67, 74, 119, 120 174, 177, 212, 232, 234, 263, Occitan, 7 267, 268, 269, 273, 275, 276, offset, 212 278, 280, 281, 283 Old World monkeys, 165 neuroethological, 30 omni-directional, 125, 147 New World monkeys, 165 on-line, 121, 145, 252, 255, 256 newborn, 65, 91, 94, 102, 104, onset, 78, 113, 114, 143, 174, 181, 212, 225 184, 193, 205, 212, 219 NLSA [nonlexical speech ontogenesis, 103, 106 automatism], 266, 267, 269 ontogenetic, 49, 58, 77, 81, 105, nonallometric, 162 143, 144, 170 nonfluent, 267, 270, 280 ontogeny, 12, 45, 49, 67, 83, 85, nonhuman primates, 30, 47, 104, 115, 137, 143, 144, 145, 48, 50, 51, 52, 56, 57, 58, 59, 153, 208, 210, 232, 257, 262 62, 65, 66, 83, 113, 116, 117, open, 8, 38, 90, 107, 108, 112, 121, 138, 156 113, 116, 118, 120, 122, 123, nonhumans, 57 124, 125, 126, 127, 134, 137, nonlexical speech automa138, 146, 152, 153, 220, 221, tism, 264, 266 241, 242, 243, 244, 245, 247, nonsensical, 70 252, 253, 256, 264 nonverbal reference, 67, 70 open class, 107, 122, 123, 124, nose, 151 137, 138, 146, 152, 153, 241, nucleus, 113, 162, 169, 212, 213, 242, 243, 244, 245, 246, 247, 237 248, 250, 252, 253, 256 open-close, 38, 113, 221

open-ended, 107, 108, 120, 146 open repertoire, 107, 112, 118, 125, 127 opening, 36, 49, 113, 183, 188 openness, 117, 124, 145, 146, 147, 152, 153 optimal, 33, 97, 233 oral, 49, 50, 87, 88, 93, 98, 101, 104, 116, 117, 125, 134, 149, 150, 153, 154, 205, 214, 234, 235 orangutan, 73, 85, 165 orbital, 163 order, 6, 108, 137, 138, 159, 161, 162, 241, 246, 247, 248, 251, 253, 254, 263, 264, 273 orient, 51, 52, 63, 71, 76 orientation, 10, 64, 85 origin of language, 50, 56, 61, 66 origin of speech, 48, 60, 104 oro-facial, 8, 31, 38, 40, 41, 42, 117, 121, 122, 124 orosensorial, 207, 214 oscillation, 115, 135, 136, 137, 138, 148 output, 41, 96, 115, 136, 141, 142, 150, 151, 152, 156, 160, 232 overt, 6, 75, 209 P palate, 90, 94, 207, 214, 215 pale chanting goshawk, 4 paleomammalian, 162 Pan paniscus, 73, 86 Pan troglodytes, 10, 27, 64, 84, 85 panic call, 15, 19, 22, 26 pantomime, 117, 119, 120, 121, 124, 126, 146, 151 paracingulate, 164, 175 paralimbic, 164 parameter, 93, 94, 96, 211, 219, 245 parceling, 151 parent, 125, 133, 136, 142, 143

parental, 133, 142, 143, 153, 156, 170 parietal, 8, 32, 42, 45, 111, 115, 165, 169, 269 particulate, 112 partner, 53, 54, 57, 58, 70, 72, 73, 74, 75, 81, 136, 271 passive, 241, 249, 250, 251, 259 pathway, 42 patient, 121, 145, 151, 170, 184, 214, 236, 239, 278, 282 pattern, 7, 63, 78, 117, 140, 141, 142, 148, 168, 171, 192, 193, 194, 198, 199, 200, 201, 202, 203, 229, 252, 253, 255, 265, 276, 280 perception, 32, 43, 47, 48, 51, 52, 56, 59, 62, 63, 105, 107, 108, 111, 131, 136, 168, 177, 207, 208, 210, 213, 214, 216, 232, 234, 235, 236, 237, 238, 242, 256, 258, 262, 263, 270 perceptual, 41, 104, 120, 131, 157, 208, 214, 216, 218, 226, 227, 229, 232, 235, 239, 242, 244, 253, 258, 269, 275 peripheral, 159 peri-sylvian, 275 PET, 44, 171 PF [parietal], 111 pharynx, 87, 93, 94, 101, 102, 212 phase, 8, 113, 136, 137, 138, 152, 210, 227, 228 phasic, 207, 212 phonatory, 41, 42, 169 phonetic, 136, 153, 154, 211, 216, 218, 219, 221, 231, 233, 235, 262 phonetics, 103, 106, 130, 219, 234, 235 phonological form, 108, 111, 120, 134 phonological loop, 8 phrasal semantics, 239, 240, 241, 245, 246, 249, 254, 255

Subject index

phylogenesis, 103, 106 phylogenetically, 162, 171, 209 phylogeny, 85, 115, 137, 143, 144, 145, 148, 153, 155, 162, 208, 210, 232, 257, 262, 280 Piagetian, 78, 82 pidgin, 263 pigtail macaques, 44 place, 148, 155, planning, 159, 160, 169, 277, 281 planum temporale, 50, 63 play, 6, 79, 85, 109, 168, 186 playback, 13, 15, 17, 18, 22, 24 played back, 18 PLD [primary linguistic data], 240 point, 5, 7, 9, 67, 68, 69, 71, 72, 74, 75, 76, 78, 79, 81, 84, 85, 176, 180, 183, 184, 185, 187, 188, 190, 200 pointing, 1, 3, 4, 5, 7, 8, 9, 10, 53, 58, 59, 62, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 176, 180, 185, 189, 205 Pongo pygmaeus, 70, 83 positron, 42, 43, 167 possessive, 188 posture, 13, 15, 16, 17, 26 potential, 52, 87, 96, 103, 123, 126, 173, 176, 225, 226, 237, 271 pout face, 41 pragmatic, 6, 239, 241, 249, 265, 274 prairie dog, 2, 14, 28 praxic, 119, 120 praxis, 126 prayer, 265 pre-adaptation, 29, 48, 87, 102 pre-babbler, 226 precursor, 29, 30, 38, 40, 57, 110, 122, 124, 125, 135, 138, 145, 146, 149, 152, 232 predator, 4, 5, 11, 13, 14, 15, 16, 17, 18, 19, 22, 26, 28, 30, 62

predator approaching, 18, 19, 22, 26, 30 predicate, 3, 5, 6, 7, 10, 183, 184, 185, 203, 239, 244, 253 preference, 64, 65 preferential, 49, 54, 55 preferentially, 54, 194 prefrontal, 162, 163, 165, 167, 169, 174, 177, 268, 269 prelimbic, 175 prelinguistic, 4, 57, 58, 84, 159, 170, 171 premotor, 32, 36, 38, 40, 42, 43, 45, 56, 63, 111, 115, 118, 123, 130, 154, 165, 169, 176, 269 prerequisites, 47, 75, 87, 102, 159, 171 pre-speech, 133, 147 pressure, 2, 3, 6, 26, 113, 117, 125, 137 preverbal, 70, 73, 83, 86 primary linguistic data [PLD], 240 primates, 2, 5, 30, 31, 36, 43, 44, 45, 47, 48, 52, 53, 59, 62, 66, 83, 102, 104, 116, 121, 122, 145, 157, 161, 163, 168, 174, 262, 282 primitive, 30, 32, 40, 41, 59, 124, 244, 267, 271, 275, 278 probabilistic, 226, 227, 238 probability, 54, 217, 218, 221, 226, 227, 235, 272 problem-solving, 80, 166 process, 34, 40, 41, 51, 93, 94, 107, 109, 119, 124, 126, 135, 136, 137, 138, 139, 182, 207, 212, 216, 231, 232, 264, 274, 282 processing, 44, 49, 50, 51, 52, 60, 61, 62, 63, 66, 126, 135, 159, 163, 167, 169, 173, 206, 208, 212, 234, 237, 239, 240, 247, 249, 250, 254, 255, 256, 257, 258, 262, 265, 268, 274, 276

305

306

Subject index

procreation, 137 production, 11, 28, 31, 32, 40, 42, 44, 47, 48, 49, 51, 52, 54, 55, 56, 59, 60, 61, 79, 96, 105, 106, 107, 108, 111, 113, 114, 115, 120, 121, 129, 130, 136, 137, 139, 141, 143, 144, 146, 148, 149, 152, 154, 155, 157, 160, 169, 170, 175, 179, 181, 183, 186, 188, 192,193, 194, 195, 196, 198, 199, 200, 201, 203, 204, 205, 208, 210, 214, 224, 232, 233, 234, 236, 257, 266, 267, 268, 269, 271, 272, 274, 277, 279, 280, 281, 282 productive, 27, 142, 194, 199, 241, 242, 253 program, 62, 102, 139, 208, 209, 233 programming, 133, 207, 216, 235, 261, 277, 281 pronoun, 241, 251, 252, 254, 261, 270, 272, 273, 278 propagating, 170 propagation, 126 property, 29, 32, 33, 36, 40, 45, 52, 70, 96, 117, 136, 144, 146, 150, 164, 168, 236, 244, 264 propositional, 264, 265, 268, 270 propositionality, 265 prosimians, 165 prosodic, 7, 11, 257 prosody, 170, 175, 243, 277, 280 protective, 137 protein, 167, 176 proto-command, 6 proto-language, 3, 4, 6 protosign, 107, 110, 111, 112, 117, 118, 120, 121, 122, 123, 124, 125, 126, 127, 128, 146, 152, 161 proto-speech, 9, 107, 110, 111, 112, 117, 118, 121, 122, 123, 124, 125, 126, 127, 128, 147, 149, 150, 152, 153, 161, 263, 274

proto-vocal apparatus, 123 protoword, 110 protrusion, 35, 38, 93, 94, 99, 102, 122, 124, 149, 150, 207, 211, 213 proximal, 7, 33, 43, 135, 136, 138, 152, 154 putamen, 169, 269 pyramidal, 117, 165, 167, 169, 176 python, 4 Q quadrupedal, 77 R racism, 270 rat, 167, 168, 175, 177 rational, 161 reaching, 7, 34, 55, 56, 62, 82, 124, 149 reach-outs, 72 reaction, 4, 126, 168 reactive, 160 reactively, 270 readiness, 113 ready, 107, 109, 110, 121 reared, 54, 76 rearing, 74, 76, 77, 79 reasoning, 75, 217, 236, 238, 257, 282 recapitulate, 144 recapitulation, 143 recapitulationist, 81, 143 receiver, 24, 34, 136 reception, 61 receptor, 177 recipient, 124, 146, 244, 259, 260 reciprocal, 163 recitation, 264, 265 recognition, 29, 32, 33, 41, 42, 43, 45, 57, 60, 63, 111, 120, 129, 156, 174, 234, 253, 278 recognize, 27, 29, 32, 41, 52, 73, 75, 108, 110, 111, 118, 163

recovery, 43, 94, 213, 261, 264, 267, 276, 280 recruit, 8 recruiting, 3, 8 recruitment, 5, 13, 27, 63, 163 recurrent utterance, 261, 279 recursion, 1, 256, 270 re-direct, 74, 81, 152 re-direction, 79 refer, 5, 13, 14, 62, 70, 73, 110, 170, 184, 265 reference, 5, 10, 28, 67, 70, 85, 88, 91, 94, 189, 211, 216, 226, 241, 251 referential alarm call, 13, 15, 26 reflex, 168 reflexive, 143, 241, 251, 252, 254, 260 regular, 193 regularity, 198, 240, 245 regulation, 135, 165, 166, 171, 176 reinforce, 49, 58 relative (clause), 241, 249, 250, 251, 252, 254, 259 relative(s), 30, 41, 81, 138 relativizer, 1 release, 145 religious, 271 remnant, 9 repertoire, 35, 41, 107, 112, 113, 118, 120, 121, 124, 125, 127 report, 55, 60, 73, 115 representation, 11, 33, 34, 38, 41, 44, 56, 64, 96, 112, 120, 130, 154, 155, 160, 174, 210, 214, 231, 233, 236, 239, 241, 244, 246, 248, 249, 250, 252, 257, 271, 275, 277, 283 representational skills, 131, 179 reptilian, 3, 135, 162 resonance, 50, 90, 103, 105 respiration, 113, 116 respiratory, 104 response, 5, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 33,

34, 35, 37, 49, 52, 55, 58, 70, 75, 79, 85, 104, 119, 120,135, 136, 137, 151, 160, 164, 168, 171, 172, 226, 235 responsive, 123 responsiveness, 42 retrieval, 78, 79, 81, 248 rhesus macaque, 14, 28, 52 rhythm, 138, 148 rhythmic, 7, 115, 134, 136, 137, 138, 143, 154, 210 right, 8, 43, 49, 50, 51, 52, 54, 55, 56, 59, 60, 61, 63, 64, 165, 175, 262, 265, 268, 269, 270, 274, 279, 280, 281, 282, 283 rime, 113 ritualization, 38, 77 robot, 207, 208, 210, 211, 212, 213, 216, 217, 218, 225, 226, 227, 228, 230, 231, 232, 234, 235, 237, 253, 256, 258 robotic, 218, 231, 232, 242, 253 robotics, 11, 96, 131, 207, 208, 209, 210, 226 role, 31, 32, 34, 38, 43, 47, 49, 63, 65, 82, 84, 86, 104, 109, 113, 114, 116, 117, 118, 121, 128, 134, 136, 143, 144, 145, 146, 149, 152, 154, 160, 163, 167, 168, 169, 170, 172, 174, 176, 177, 179, 185, 196, 199, 202, 206, 212, 221, 225, 229, 235, 236, 254, 256, 257, 268, 271, 273, 276, 279 root, 91, 94, 102, 115, 148 rostral, 163, 165 rough-and-tumble play, 6 route, 40, 116, 119, 232, 271, 275 routine, 128, 173, 263 S sagittal, 207, 211, 226 salience, 40, 165 salient, 190 Saussurean, 3

Subject index

scaffolding, 107, 110, 112, 117, 118, 120, 121, 123, 124, 125, 127, 128, 146 scan, 3, 24, 26 scene analysis, 2, 239 schema, 111, 120 schemata, 269 scheme, 127, 186 scratch, 41, 121, 144 seat, 146 secondary, 26, 78, 81, 212 segment, 105, 277 selection, 26, 61, 81, 128, 134, 135, 136, 137, 142, 170, 173, 225, 259 self, 104, 141, 145, 151, 152, 159, 166, 171, 175, 176, 226, 234, 237, 269, 271, 273, 278 self-arouse, 271 self-awareness, 159, 273 self-consciousness, 273, 278 self-control, 166 self-generated, 145, 151 self-initiated, 152, 175, 269 self-organization, 141, 234, 237 self-perspective, 171 self-regulation, 171, 176 semantic, 2, 6, 53, 62, 63, 65, 70, 108, 111, 120, 121, 127, 159, 161, 162, 169, 170, 173, 184, 188, 191, 244, 247, 250, 251, 254, 266, 267, 272, 281 semantic form, 108, 111, 127 semantic-conceptual, 2 semantics, 10, 112, 117, 177, 239, 240, 241, 245, 246, 249, 250, 252, 254, 255, 258, 262, 276 semiotic, 7, 86, 130 sender, 34, 136 sensibilities, 271 sensitive, 8, 72, 114, 149, 150, 214 sensitivity, 114, 150, 214, 258 sensorimotor, 50, 78, 85, 281, 282

sentence, 114, 239, 240, 241, 242, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 263, 271 sequence, 115, 118, 139, 140, 143, 144, 145, 148, 191, 250, 253, 254, 257, 258, 272 serial, 138, 139, 141, 144, 154, 156, 157, 256, 264, 265 serially, 137 sex, 86, 104, 270, 271, 275 sexual, 271, 274 shape, 7, 9, 44, 71, 88, 89, 93, 94, 104, 119, 135, 137, 138, 207, 208, 209, 210, 211, 224, 226, 231 share, 57, 58, 73, 74, 78, 85, 102, 124, 133, 148, 172, 263, 276 shared attention, 73, 80 sharing, 58, 72, 78, 80, 167, 171 shelter, 3, 15, 17, 19, 22, 24, 26 shift, 52, 61, 116, 128, 142, 161 sign, 9, 49, 63, 73, 83, 109, 110, 111, 120, 121, 125, 128, 129, 130, 131, 133, 146, 147, 148, 149, 150, 153, 157, 163, 171 sign language, 49, 63, 73, 109, 110, 121, 128, 129, 130, 133, 147, 148 signal, 11, 27, 52, 58, 60, 62, 133, 142, 146, 153, 233, 243, 267, 274 signaler, 30, 53, 70, 73, 74, 75 signaling, 77, 80 signed language, 44, 49, 108, 117, 125, 126, 128, 147, 153 signing, 85, 107, 108, 125, 126, 128 simulation, 91, 94, 98, 209, 225, 243, 255, 256, 257 simultaneous, 34, 49, 81 simultaneously, 8, 49, 55, 80, 118, 152, 253 situation, 2, 15, 16, 18, 19, 22, 24, 27, 108, 125, 141, 150, 239, 265

307

308

Subject index

skill, 119, 124, 144, 157, 176 skull, 88, 89, 93 SMA [supplementary motor area], 145, 146, 151, 152, 269, 274, 278 smiling, 80 snake, 4, 5, 27 social cognition, 79, 86, 159, 160, 170, 171, 177 somatic, 166 somatosensory, 112, 169, 226 somesthesic, 214, 236 sound, 2, 29, 30, 34, 35, 38, 39, 40, 41, 49, 51, 56, 60, 73, 95, 101, 111, 115, 118, 122, 123, 126, 128, 130, 136, 137, 142, 145, 147, 150, 153, 168, 170, 188, 209, 210, 211, 212, 213, 216, 219, 221, 222, 223, 225, 226, 229, 230, 231, 234, 235, 264, 274, 277 source, 13, 26, 144, 244 South Africa, 16, 27 special, 11, 74, 237, 238, 268 specialized, 51, 58, 70, 107, 128, 171, 174, 212 speech, 1, 2, 6, 8, 9, 10, 11, 12, 14, 29, 30, 38, 40, 41, 42, 43, 44, 47, 48, 49, 50, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 73, 82, 87, 88, 92, 96, 97, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 115, 116, 117, 118, 121, 124, 125, 126, 127, 128, 129, 131, 133, 134, 135, 136, 137, 138, 139, 140, 141, 143, 144, 145, 146, 147, 148, 150, 153, 154, 155, 156, 157, 159, 165, 169, 170, 175, 179, 180, 181, 183, 184, 188, 189, 203, 204, 205, 207, 208, 209, 212, 213, 214, 219, 224, 225, 231, 232, 233, 234, 235, 236, 237, 243, 253, 257, 258, 261, 262, 263, 264, 265, 267, 268, 269, 270, 271,

272, 273, 274, 275, 276, 277, 279, 280, 281, 282, 283 speech abilities, 87 speech acquisition, 139, 140, 157, 207, 208, 232 speech act, 6, 270, 276 speech apparatus, 134 speech apraxia, 276 speech area, 40 speech automatism, 264, 267, 273, 279 speech capabilities, 96 speech circuit, 57 speech control, 8, 59, 213, 214, 237 speech deixis, 8 speech development, 11, 169, 208, 209, 224, 225, 231, 236, 237 speech domain, 114 speech error research, 139 speech evolution, 11, 116 speech forms, 10, 11 speech frame, 115, 264 speech games, 232 speech gestures, 208, 212, 213, 214, 237 speech motor control, 12 speech movements, 154, 233, 235 speech ontogeny, 139, 143, 156, 281 speech output control, 146 speech output organization, 146 speech patterns, 155 speech perception, 11, 52, 63, 65, 231, 234, 237 speech phylogeny, 139, 232 speech planning, 261, 277 speech processing, 57 speech production, 11, 12, 38, 48, 88, 92, 97, 103, 113, 144, 145, 146, 154, 155, 157, 212, 234, 235, 236, 237, 268, 270, 277, 281

speech robotics, 11, 207, 208, 237, 238 speech signal, 212, 236, 243 speech sounds, 103, 170, 209, 212, 213, 219, 225, 231, 264, 274 speech-action production, 262 speech-action system, 274 speech-based communication, 170 speech-like behavior, 144 speech-like utterances, 144 speech-like vocalizations, 143 speech-related areas, 60 speech-related capacities, 136 speech-related output, 139 speed, 14, 22, 26, 123 spindle, 165, 166, 168, 169, 175 spiral, 127, 147 spoken language, 34, 42, 113, 114, 116, 125, 126, 128, 133, 146, 152, 155, 157, 180, 206, 265, 272 spontaneous, 48, 49, 53, 54, 79, 86, 145, 151, 170, 179, 181, 280 stabil-loop, 8 stage, 3, 4, 6, 22, 41, 112, 135, 138, 143, 145, 146, 149, 152, 153, 179, 192, 193, 203, 209, 216, 221, 224, 225, 263, 265, 267, 269, 272, 273, 274, 277, 278 stand for, 53, 70 state, 10, 14, 30, 48, 70, 73, 75, 85, 124, 138, 149, 226, 227, 234, 239, 240, 244, 255, 266 step, 8, 9, 40, 41, 42, 137, 138, 151, 169, 183, 209, 216, 217, 228, 279 stimulated, 3, 54, 151, 168 stimulation, 145, 151, 269 stimulus, 13, 15, 33, 71, 160, 169, 212, 226 stream, 253 streamlining, 2, 7

stress, 111, 117, 121, 125, 150, 166, 266 striatum, 174 stroke, 7, 43, 276 subject [S], 6, 242 Subject Verb Object [SVO], 242 subordinate, 5 subordination, 162 sucking, 34, 49, 115, 122, 123, 135, 145, 149, 150, 264 suckling, 225 sulcus, 165 supplementary, 117, 151, 156, 157, 165, 168, 174, 184, 185, 190, 196, 198, 199, 201, 203, 268, 269 supplementary motor area [SMA], 117, 151, 156, 157, 165, 168, 268, 269 supramarginal, 9 surface structure, 266, 267 Suricata suricatta, 13, 14, 27, 28 survival, 4, 6, 121, 136, 137, 166 swearing, 265, 270, 282 syllable, 7, 10, 113, 114, 115, 133, 134, 138, 139, 140, 141, 143, 144, 147, 148, 153, 219, 264, 267, 269, 274, 278, 281 syllable onset, 113, 114 SVO [Subject Verb Object], 242 Sylvian, 51 symbol, 3, 30, 65, 70, 85, 126, 142, 159, 185, 205, 244 symbolic, 3, 7, 53, 67, 70, 73, 81, 83, 85, 121, 155, 181, 183, 185, 188, 191, 198 symptomatology, 262 synchronous, 191, 200, 201, 202 syndrome, 268, 281 syntactic, 1, 8, 11, 110, 139, 159, 161, 169, 240, 242, 245, 247, 249, 251, 255, 256, 257, 258, 262, 263, 265, 271, 275, 276

Subject index

syntax, 8, 9, 10, 82, 107, 110, 112, 139, 175, 232, 249, 254, 255, 256, 257, 258, 262, 263, 269, 270, 272, 273, 274, 275, 276, 277, 278, 280 system, 4, 6, 22, 29, 31, 32, 34, 36, 38, 40, 41, 42, 44, 58, 59, 94, 104, 105, 107, 110, 111, 112, 118, 120, 121, 123, 124, 125, 126, 127, 130, 131, 135, 136, 137, 138, 144, 145, 147, 151, 152, 153, 157, 162, 163, 168, 172, 177, 180, 188, 207, 208, 209, 212, 214, 216, 218, 225, 226, 230, 232, 240, 242, 245, 252, 253, 254, 255, 264, 268, 269, 273, 277, 279, 281, 282

153, 155, 157, 161, 170, 171, 175, 176, 205, 213, 234, 236, 237, 240, 256, 258, 262, 273, 275, 280, 281, 282 theory of mind, 73, 82, 161, 170, 171, 175, 176, 273, 282 theta-role, 272, 274, 276 thumb, 49, 188 tip, 94, 183, 211, 214 TMS [transcranial magnetic stimulation], 40 toddler, 57 tomography, 42, 43, 167, 278 tongue, 8, 35, 87, 90, 91, 94, 99, 101, 102, 105, 113, 116, 124, 136, 139, 140, 141, 144, 149, 207, 209, 211, 213, 214, 216, 221, 223, 225, 226, 227, 237 T tongue body, 87, 99, 101, 211, tactics, 81 221, 225, 226 talk, 84, 129, 130, 154, 156, 266, tonic, 207, 212 271 tool, 78, 82, 84, 85, 109, 115, 125 talking baby robot, 207 tool making, 109 talking/replying, 170 tool use, 78, 82, 84, 85, 115 target, 32, 33, 175, 177, 209, 218, topic, 182, 241, 249 226, 227, 229, 230, 240, 243 track, 2, 3, 4, 5, 153, 207, 274 tawny eagle, 4 tracking, 212, 253, 256 tearing, 33, 34 trained, 45, 53, 58, 68, 71, 73, teeth, 34, 116, 124, 138, 149, 150 76, 78, 79, 80, 191 terrestrial, 4, 5, 14, 15, 16, 17, 18, transcranial magnetic stimu19, 22, 26 lation [TMS], 167 tertiary, 81 transformation, 8, 18, 38, 45, thalamus, 44, 162, 169, 174, 102, 120, 248, 256 269, 274 transition, 10, 34, 36, 42, 73, 78, that-path, 8, 9 108, 109, 116, 117, 119, 124, thematic, 239, 241, 246, 247, 126, 138, 149, 157, 166, 179, 249, 251, 254, 282 183, 184, 189, 204, 205 thematic role, 241, 246, 247, transitional, 4, 78, 144 249, 251, 254 triadic, 75 theoretical, 116, 160, 182, 216, trigger, 136, 138, 152 217, 257, 262, 264, 282, 283 triune, 162, 168, 176 theory, 3, 4, 6, 11, 44, 50, 59, 60, tube, 90, 91, 94, 103, 133, 224, 64, 73, 82, 87, 88, 92, 103, 113, 236 114, 115, 116, 120, 129, 130, tufted capuchin monkey, 27 133, 136, 137, 138, 139, 142, type, 3, 4, 13, 14, 15, 16, 17, 18, 22, 143, 145, 146, 147, 149, 152, 24, 26, 28, 30, 32, 34, 40, 114,

309

310

Subject index

134, 142, 147, 160, 165, 173, 176, 184, 190, 193, 194, 196, 198, 242, 244, 245, 248, 252, 253, 259, 262, 269, 278 U UG [universal grammar], 240 ultrasound, 2 unaware, 266, 273 understanding, 29, 33, 34, 38, 40, 41, 42, 43, 44, 45, 56, 64, 67, 69, 74, 78, 85, 101, 127, 130, 147, 148, 153, 155, 156, 157, 160, 161, 175, 182, 190, 254, 258, 277 unitary utterance, 110 universal, 86, 114, 133, 134, 139, 179, 188, 233, 240, 270, 272, 273 universal grammar [UG], 240 universals, 156, 235 unreachable, 72, 75, 77, 78, 80, 81 unstressed, 257 urgency, 11, 13, 14, 15, 16, 17, 18, 19, 22, 24, 26, 28 urine, 4 usage-based, 11, 241, 253, 258 utterance, 4, 108, 110, 140, 156, 180, 185, 190, 191, 192, 196, 198, 200, 254, 266, 267, 269, 270, 271, 272, 274, 278 V V [Verb], 242 variety, 15, 58, 59, 78, 109, 157, 167, 180, 183, 191, 254, 255, 275 vehiculated, 184 velar, 93 velocity, 13, 214, 253 velum, 113, 136 ventral, 32, 36, 38, 40, 43, 123, 130, 149, 154, 164, 165, 167, 169, 174 ventrolateral, 169

verb, 6, 112, 241, 244, 250, 251, 252, 256, 257, 258, 259, 266, 271, 272, 274, 275, 276 verbal, 3, 8, 10, 11, 72, 73, 76, 151, 170, 242, 265, 271, 281, 282 versatility, 144 vervet monkey, 2, 14, 27, 30, 43, 58 vigilance, 15 visceral, 165 visceromotor, 167 viscerosensory, 167 visual, 1, 13, 15, 18, 24, 26, 28, 33, 34, 44, 53, 56, 58, 67, 71, 75, 81, 83, 84, 112, 115, 123, 133, 136, 147, 153, 207, 210, 213, 216, 226, 229, 230, 244, 253, 256, 258 visual communication, 1 vocabulary, 41, 112, 122, 179, 205, 256, 257 vocal, 3, 4, 7, 9, 28, 30, 31, 38, 41, 43, 44, 48, 49, 52, 54, 56, 59, 60, 62, 63, 64, 66, 70, 73, 81, 84, 87, 88, 90, 93, 94, 96, 97, 101, 102, 103, 104, 105, 106, 107, 110, 112, 113, 116, 117, 118, 121, 122, 123, 124, 125, 126, 127, 128, 130, 131, 133, 134, 136, 137, 138, 142, 144, 145, 147, 148, 149, 150, 152, 153, 157, 161, 168, 172, 176, 179, 180, 181, 182, 185, 186, 188, 189, 190, 191, 193, 194, 195, 196, 198, 199, 202, 203, 205, 207, 208, 209, 210, 211, 212, 216, 219, 222, 225, 226, 231, 233, 234, 235, 236, 237, 268, 271, 274 vocal action, 212 vocal animals, 60 vocal apparatus, 112, 118, 121, 122, 123, 125, 127, 134, 208, 235 vocal articulation, 116, 117

vocal articulators, 113 vocal asymmetry, 49 vocal behavior, 64 vocal calls, 30 vocal capacities, 137 vocal change, 136 vocal channel, 137, 182 vocal communication, 30, 31, 43, 44, 60, 70, 84, 116, 122, 136, 137, 152, 172 vocal communicative system, 30, 41 vocal complexity, 137 vocal component, 274 vocal control, 44, 49, 130, 168, 176 vocal counterparts, 198, 199 vocal cue, 7 vocal deixis, 131, 179, 194, 195, 198 vocal domain, 112, 127, 148 vocal element, 128, 161, 169, 182, 189, 190, 191, 199, 271 vocal emotion, 169 vocal exploration, 207, 219, 225 vocal expressions, 52, 63, 66 vocal fold, 90, 113, 210 vocal forms, 136, 142 vocal-gestural, 41 vocal gestures, 107, 110, 126, 128 vocal grooming, 138 vocal imitation, 207, 208, 209, 219, 225, 231, 237 vocal intensity, 104 vocal machinery, 124 vocal means, 189 vocal medium, 147, 152, 153 vocal modality, 179, 181, 193, 194, 203 vocal mode, 149 vocal only, 127, 190, 198, 199, 203 vocal operant conditioning, 168 vocal origin, 147, 150

vocal patterns, 144 vocal precursor, 148 vocal production, 30, 44, 113, 179, 182, 186, 188, 189, 199, 205 vocal reaction, 117 vocal recognition, 63 vocal repertoire, 30, 126, 181 vocal resonator, 210 vocal sequence, 149 vocal signal, 30, 52, 54, 62, 81, 137, 161 vocal speech, 128, 161 vocal system, 28, 162 vocal tract, 87, 88, 90, 93, 94, 96, 97, 98, 101, 102, 103, 104, 105, 106, 113, 157, 207, 209, 210, 211, 216, 222, 226, 233, 234, 236 vocal tract contours, 94 vocal tract growth, 157, 210, 211 vocal tract length, 94, 97 vocal tract model, 93, 94

Subject index

vocal tract morphology, 93, 236 vocal tract shape, 88, 90, 106, 207 vocal tract variable, 234 vocal utterance, 117, 168, 190, 191, 193, 198, 199 vocalization, 13, 14, 29, 30, 31, 38, 41, 42, 47, 54, 55, 59, 61, 65, 114, 115, 116, 117, 118, 121, 122, 124, 125, 127, 148, 150, 155, 159, 161, 167, 168, 170, 173, 174, 189, 192, 221, 229, 263, 264 vocalize, 7, 8, 209, 240 vocalize to localize, 1, 13, 240 vocally, 87 voice, 9, 63, 174, 225, 263, 269 voicing, 113, 212 voicing offset, 212 voicing onset, 212 voluntary, 4, 30, 32, 40, 48, 116, 117, 122, 123, 151, 152, 161, 162, 163, 166, 168, 269, 270

vowel, 8, 87, 88, 90, 91, 92, 94, 97, 102, 103, 104, 105, 114, 133, 134, 138, 139, 140, 141, 144, 169, 209, 219, 224, 225, 229, 232, 233, 234, 235, 236, 237, 266 vowel acoustics, 87 W Western, 67 width, 207, 213 wild, 2, 10, 16, 27, 42, 44, 65, 67, 68, 69, 74, 76, 77, 78, 80, 81, 84, 86, 262, 277 word, 3, 6, 8, 9, 10, 32, 61, 111, 113, 126, 139, 141, 142, 143, 156, 165, 169, 175, 179, 180, 181, 183, 184, 185, 186, 191, 192, 193, 194, 203, 204, 205, 241, 242, 245, 246, 247, 251, 254, 263, 271, 272, 274, 280 working memory, 8, 11, 119, 276

311

In the series Benjamins Current Topics (BCT) the following titles have been published thus far or are scheduled for publication: 19 Sekine, Satoshi and Elisabete Ranchhod (eds.): Named Entities. Recognition, classification and use. v, 164 pp. + index. Expected August 2009 18 Moon, Rosamund (ed.): Words, Grammar, Text. Revisiting the work of John Sinclair. viii, 122 pp. + index. Expected July 2009 17 Flowerdew, John and Michaela Mahlberg (eds.): Lexical Cohesion and Corpus Linguistics. 2009. vi, 124 pp. 16 Dror, Itiel E. and Stevan Harnad (eds.): Cognition Distributed. How cognitive technology extends our minds. 2008. xiii, 258 pp. 15 Stekeler-Weithofer, Pirmin (ed.): The Pragmatics of Making it Explicit. 2008. viii, 237 pp. 14 Baker, Anne and Bencie Woll (eds.): Sign Language Acquisition. 2009. xi, 167 pp. 13 Abry, Christian, Anne Vilain and Jean-Luc Schwartz (eds.): Vocalize to Localize. 2009. x, 311 pp. 12 Dror, Itiel E. (ed.): Cognitive Technologies and the Pragmatics of Cognition. 2007. xii, 186 pp. 11 Payne, Thomas E. and David J. Weber (eds.): Perspectives on Grammar Writing. 2007. viii, 218 pp. 10 Liebal, Katja, Cornelia Müller and Simone Pika (eds.): Gestural Communication in Nonhuman and Human Primates. 2007. xiv, 284 pp. 9 Pöchhacker, Franz and Miriam Shlesinger (eds.): Healthcare Interpreting. Discourse and Interaction. 2007. viii, 155 pp. 8 Teubert, Wolfgang (ed.): Text Corpora and Multilingual Lexicography. 2007. x, 162 pp. 7 Penke, Martina and Anette Rosenbach (eds.): What Counts as Evidence in Linguistics. The case of innateness. 2007. x, 297 pp. 6 Bamberg, Michael (ed.): Narrative – State of the Art. 2007. vi, 271 pp. 5 Anthonissen, Christine and Jan Blommaert (eds.): Discourse and Human Rights Violations. 2007. x, 142 pp. 4 Hauf, Petra and Friedrich Försterling (eds.): Making Minds. The shaping of human minds through social context. 2007. ix, 275 pp. 3 Chouliaraki, Lilie (ed.): The Soft Power of War. 2007. x, 148 pp. 2 Ibekwe-SanJuan, Fidelia, Anne Condamines and M. Teresa Cabré Castellví (eds.): Application-Driven Terminology Engineering. 2007. vii, 203 pp. 1 Nevalainen, Terttu and Sanna-Kaisa Tanskanen (eds.): Letter Writing. 2007. viii, 160 pp.