Advances in Natural Multimodal Dialogue Systems
Text, Speech and Language Technology VOLUME 30
Series Editors Nancy Ide, Vassar College, New York Jean Véronis, Université de Provence and CNRS, France Editorial Board Harald Baayen, Max Planck Institute for Psycholinguistics, The Netherlands Kenneth W. Church, AT & T Bell Labs, New Jersey, USA Judith Klavans, Columbia University, New York, USA David T. Barnard, University of Regina, Canada Dan Tufis, Romanian Academy of Sciences, Romania Joaquim Llisterri, Universitat Autonòma de Barcelona, Spain Stig Johansson, University of Oslo, Norway Joseph Mariani, LIMSI-CNRS, France
The titles published in this series are listed at the end of this volume.
Advances in Natural Multimodal Dialogue Systems Edited by
Jan C.J. van Kuppevelt Waalre, The Netherlands
Laila Dybkj ær University of Southern Denmark, Odense, Denmark
and
Niels Ole Bernsen University of Southern Denmark, Odense, Denmark
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN-10 ISBN-13 ISBN-10 ISBN-13 ISBN-10 ISBN-13
1-4020-3934-4 (PB) 978-1-4020-3934-8 (PB) 1-4020-3932-8 (HB) 978-1-4020-3032-4 (HB) 1-4020-3933-6 (e-book) 978-1-4020-3933-1 (e-book)
Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. www.springer.com
Printed on acid-free paper
All Rights Reserved © 2005 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Printed in the Netherlands
Contents
Preface
xi
1 Natural and Multimodal Interactivity Engineering - Directions and Needs Niels Ole Bernsen and Laila Dybkjaer 1. Introduction 2. Chapter Presentations 3. NMIE Contributions by the Included Chapters 4. Multimodality and Natural Interactivity References
1 1 2 7 16 19
Part I Making Dialogues More Natural: Empirical Work and Applied Theory 2 Social Dialogue with Embodied Conversational Agents Timothy Bickmore and Justine Cassell 1. Introduction 2. Embodied Conversational Agents 3. Social Dialogue 4. Related Work 5. Social Dialogue in REA 6. A Study Comparing ECA Social Dialogue with Audio-Only Social Dialogue 7. Conclusion References 3 A First Experiment in Engagement for Human-Robot Interaction in Hosting Activities Candace L. Sidner and Myroslava Dzikovska 1. Introduction 2. Hosting Activities 3. What is Engagement? 4. First Experiment in Hosting: A Pointing Robot 5. Making Progress on Hosting Behaviours 6. Engagement for Human-Human Interaction 7. Computational Modelling of Human-Human Hosting and Engagement 8. A Next Generation Mel
v
23 23 25 29 32 36 40 48 49 55 55 56 57 59 62 63 70 73
vi
Advances in Natural Multimodal Dialogue Systems 9. Summary References
Part II
74 74
Annotation and Analysis of Multimodal Data: Speech and Gesture
4 FORM Craig H. Martell 1. Introduction 2. Structure of FORM 3. Annotation Graphs 4. Annotation Example 5. Preliminary Inter-Annotator Agreement Results 6. Conclusion: Applications to HLT and HCI? Appendix: Other Tools, Schemes and Methods of Gesture Analysis References 5 On the Relationships among Speech, Gestures, and Object Manipulation in Virtual Environments: Initial Evidence Andrea Corradini and Philip R. Cohen 1. Introduction 2. Study 3. Data Analysis 4. Results 5. Discussion 6. Related Work 7. Future Work 8. Conclusions Appendix: Questionnaire MYST III - EXILE References 6 Analysing Multimodal Communication Patrick G. T. Healey, Marcus Colman and Mike Thirlwell 1. Introduction 2. Breakdown and Repair 3. Analysing Communicative Co-ordination 4. Discussion References 7 Do Oral Messages Help Visual Search? Noëlle Carbonell and Suzanne Kieffer 1. Context and Motivation 2. Methodology and Experimental Set-Up 3. Results: Presentation and Discussion 4. Conclusion References
79 79 80 85 86 88 90 91 95 97 97 99 101 103 106 106 108 108 110 111 113 113 117 125 126 127 131 131 134 141 153 154
Contents 8 Geometric and Statistical Approaches to Audiovisual Segmentation Trevor Darrell, John W. Fisher III, Kevin W. Wilson, and Michael R. Siracusa 1. Introduction 2. Related Work 3. Multimodal Multisensor Domain 4. Results 5. Single Multimodal Sensor Domain 6. Integration References
vii 159 159 160 162 166 167 175 178
Part III Animated Talking Heads and Evaluation 9 The Psychology and Technology of Talking Heads: Applications in Language Learning Dominic W. Massaro 1. Introduction 2. Facial Animation and Visible Speech Synthesis 3. Speech Science 4. Language Learning 5. Research on the Educational Impact of Animated Tutors 6. Summary References 10 Effective Interaction with Talking Animated Agents in Dialogue Systems Björn Granström and David House 1. Introduction 2. The KTH Talking Head 3. Effectiveness in Intelligibility and Information Presentation 4. Effectiveness in Interaction 5. Experimental Applications 6. The Effective Agent as a Language Tutor 7. Experiments and 3D Recordings for the Expressive Agent References 11 Controlling the Gaze of Conversational Agents Dirk Heylen, Ivo van Es, Anton Nijholt and Betsy van Dijk 1. Introduction 2. Functions of Gaze 3. The Experiment 4. Discussion 5. Conclusion References
183 183 184 191 194 197 210 211 215 215 217 219 223 231 235 237 239 245 245 248 252 258 260 260
viii
Advances in Natural Multimodal Dialogue Systems
Part IV Architectures and Technologies for Advanced and Adaptive Multimodal Dialogue Systems 12 MIND: A Context-Based Multimodal Interpretation Framework in Conversational Systems Joyce Y. Chai, Shimei Pan and Michelle X. Zhou 1. Introduction 2. Related Work 3. MIND Overview 4. Example Scenario 5. Semantics-Based Representation 6. Context-Based Multimodal Interpretation 7. Discussion References 13 A General Purpose Architecture for Intelligent Tutoring Systems Brady Clark, Oliver Lemon, Alexander Gruenstein, Elizabeth Owen Bratt, John Fry, Stanley Peters, Heather Pon-Barry, Karl Schultz, Zack Thomsen-Gray and Pucktada Treeratpituk 1. Introduction 2. An Intelligent Tutoring System for Damage Control 3. An Architecture for Multimodal Dialogue Systems 4. Activity Models 5. Dialogue Management Architecture 6. Benefits of ACI for Intelligent Tutoring Systems 7. Conclusion References 14 MIAMM – A Multimodal Dialogue System using Haptics Norbert Reithinger, Dirk Fedeler, Ashwani Kumar, Christoph Lauer, Elsa Pecourt and Laurent Romary 1. Introduction 2. Haptic Interaction in a Multimodal Dialogue System 3. Visual Haptic Interaction – Concepts in MIAMM 4. Dialogue Management 5. The Multimodal Interface Language (MMIL) 6. Conclusion References 15 Adaptive Human-Computer Dialogue Sorin Dusan and James Flanagan 1. Introduction 2. Overview of Language Acquisition 3. Dialogue Systems 4. Language Knowledge Representation 5. Dialogue Adaptation 6. Experiments
265 265 267 267 268 270 277 282 283 287
287 288 294 295 298 301 302 303 307 308 309 313 319 326 331 331 333 333 334 337 340 341 347
Contents 7. Conclusion References
ix 351 353
16 Machine Learning Approaches to Human Dialogue Modelling Yorick Wilks, Nick Webb, Andrea Setzer, Mark Hepple and Roberta Catizone 1. Introduction 2. Modality Independent Dialogue Management 3. Learning to Annotate Utterances 4. Future work: Data Driven Dialogue Discovery 5. Discussion References
355 357 362 366 367 368
Index
371
355
Preface
The chapters in this book jointly contribute to what we shall call the field of natural and multimodal interactive systems engineering. This is not yet a well-established field of research and commercial development but, rather, an emerging one in all respects. It brings together, in a process that, arguably, was bound to happen, contributors from many different, and often far more established, fields of research and industrial development. To mention but a few, these include speech technology, computer graphics and computer vision. The field’s rapid expansion seems driven by a shared vision of the potential of new interactive modalities of information representation and exchange for radically transforming the world of computer systems, networks, devices, applications, etc. from the GUI (graphical user interface) paradigm into something which will enable a far deeper and much more intuitive and natural integration of computer systems into people’s work and lives. Jointly, the chapters present a broad and detailed picture of where natural and multimodal interactive systems engineering stands today. The book is based on selected presentations made at the International Workshop on Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems held in Copenhagen, Denmark, in 2002 and sponsored by the European CLASS project. CLASS was initiated on the request of the European Commission with the purpose of supporting and stimulating collaboration among Human Language Technology (HLT) projects as well as between HLT projects and relevant projects outside Europe. The purpose of the workshop was to bring together researchers from academia and industry to discuss innovative approaches and challenges in natural and multimodal interactive systems engineering. The Copenhagen 2002 CLASS workshop was not just a very worthwhile event in an emerging field due to the general quality of the papers presented. It was also largely representative of the state of the art in the field. Given the increasing interest in natural interactivity and multimodality and the excellent quality of the work presented, it was felt to be timely to publish a book reflecting recent developments. Sixteen high-quality papers from the workshop were selected for publication. Content-wise, the chapters in this book illustrate most aspects of natural and multimodal interactive systems engineering: applicable
xi
xii
Advances in Natural Multimodal Dialogue Systems
theory, empirical work, data annotation and analysis, enabling technologies, advanced systems, re-usability of components and tools, evaluation, and future visions. The selected papers have all been reviewed, revised, extended, and improved after the workshop. We are convinced that people who work in natural interactive and multimodal dialogue systems engineering – from graduate students and Ph.D. students to experienced researchers and developers, and no matter exactly which community they come from originally - may find this collection of papers interesting and useful to their own work. We would like to express our sincere gratitude to all those who helped us in preparing this book. Especially we would like to thank all reviewers for their valuable and extensive comments and criticism which have helped improve the quality of the individual chapters as well as the entire book. THE EDITORS
Chapter 1 NATURAL AND MULTIMODAL INTERACTIVITY ENGINEERING - DIRECTIONS AND NEEDS Niels Ole Bernsen and Laila Dybkjær Natural Interactive Systems Laboratory University of Southern Denmark Campusvej 55, 5230 Odense M, Denmark
{nob, laila}@nis.sdu.dk Abstract
This introductory chapter discusses the field of natural and multimodal interactivity engineering and presents the following 15 chapters in this context. A brief presentation of each chapter is given, their contributions to specific natural and multimodal interactivity engineering needs are discussed, and the concepts of multimodality and natural interactivity are explained along with an overview of the modalities investigated in the 15 chapters.
Keywords:
Natural and multimodal interactivity engineering.
1.
Introduction
Chapters 2 through 16 of this book present original contributions to the emerging field of natural and multimodal interactivity engineering (henceforth NMIE). A prominent characteristic of NMIE is that the field is not yet an established field of research and commercial development but, rather, an emerging one in all respects, including applicable theory, experimental results, platforms and development environments, standards (guidelines, de facto standards, official standards), evaluation paradigms, coherence, ultimate scope, enabling technologies for software engineering, general topology of the field itself, ”killer applications”, etc. The NMIE field is vast and brings together practitioners from very many different, and often far more established, fields of research and industrial development, such as signal processing, speech technology, computer graphics, computer vision, human-computer interaction, virtual and augmented reality, non-speech sound, haptic devices, telecommunications, computer games, etc. 1 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 1–19. © 2005 Springer. Printed in the Netherlands.
2
Advances in Natural Multimodal Dialogue Systems
Table 1.1. Needs for progress in natural and multimodal interactivity engineering (NMIE). General
Specific
Understand issues, problems, solutions
Applicable theory: for any aspect of NMIE. Empirical work and analysis: controlled experiments, behavioural studies, simulations, scenario studies, task analysis on roles of, and collaboration among, specific modalities to achieve various benefits. Annotation and analysis: new quality data resources, coding schemes, coding tools, and standards. Future visions: visions, roadmaps, etc., general and per sub-area.
Build systems
Enabling technologies: new basic technologies needed. More advanced systems: new, more complex, versatile, and capable system aspects. Make it easy: Re-usable platforms, components, toolkits, architectures, interface languages, standards, etc.
Evaluate
Evaluate all aspects: of components, systems, technologies, processes, etc.
It may be noted that the fact that a field of research has been established over decades in its own right is fully compatible with many if not most of its practitioners being novices in NMIE. It follows that NMIE community formation is an ongoing challenge for all. Broadly speaking, the emergence of a new systems field, such as NMIE, takes understanding of issues, problems and solutions, knowledge and skills for building (or developing) systems and enabling technologies, and evaluation of any aspect of the process and its results. In the particular case of NMIE, these goals or needs could be made more specific as shown in Table 1.1. Below, we start with a brief presentation of each of the following 15 chapters (Section 2). Taking Table 1.1 as point of departure, Section 3 provides an overview of, and discusses the individual chapters’ contributions to specific NMIE needs. Section 4 explains multimodality and natural interactivity and discusses the modalities investigated in the chapters of this book.
2.
Chapter Presentations
We have found it appropriate to structure the 15 chapters into four parts under four headlines related to how the chapters contribute to the specific NMIE needs listed in Table 1.1. Each chapter has a main emphasis on issues that contribute to NMIE needs captured by the headline of the part of the book to which it belongs. The division can of course not be a sharp one. Several chap-
Natural and Multimodal Interactivity Engineering - Directions and Needs
3
ters include discussions of issues that would make them fit into several parts of the book. Part one focuses on making dialogues more natural and has its main emphasis on experimental work and the application of theory. Part two concerns annotation and analysis of multimodal data, in particular, the modalities of speech and gesture. Part three addresses animated talking heads and related evaluation issues. Part four covers issues in building advanced multimodal dialogue systems, including architectures and technologies.
2.1
Making Dialogues More Natural: Empirical Work and Applied Theory
Two chapters have been categorised under this headline. They both aim at making interaction with a virtual or physical agent more natural. Experimental work is central in both chapters and so is the application of theory. The chapter by Bickmore and Cassell (Chapter 2) presents an empirical study of the social dialogue of an embodied conversational real-estate agent during interaction with users. The study is a comparative one. In one setting, users could see the agent while interacting with it. In the second setting, users could talk to the agent but not see it. The paper presents several interesting findings on social dialogue and interaction with embodied conversational agents. Sidner and Dzikovska (Chapter 3) present empirical results from humanrobot interaction in hosting activities. Their focus is on engagement, i.e. on the establishment and maintenance of a connection between interlocutors and the ending of it when desired. The stationary robot can point, make beat gestures, and move its eyes while conducting dialogue and tutoring on the use of a gas turbine engine shown to the user on a screen. The authors also discuss human-human hosting and engagement. Pursuing and building on theory of human-human engagement and using input from experiments, their idea is to continue to add capabilities to the robot which will make it become better to show engagement.
2.2
Annotation and Analysis of Multimodal Data: Speech and Gesture
The five chapters in this part have a common emphasis on data analysis. While one chapter focuses on annotation of already collected data, the other four chapters all describe experiments with data collection. In all cases, the data are subsequently analysed in order to, e.g., provide new knowledge on conversation or show whether a hypothesis was true or not. Martell (Chapter 4) presents the FORM annotation scheme and illustrates its use. FORM enables annotators to mark up the kinematic information of
4
Advances in Natural Multimodal Dialogue Systems
gestures in videos. Although designed for gesture markup, FORM is also designed to be extensible to markup of speech and other conversational information. The goal is to establish an extensible corpus of annotated videos which can be used for research in various aspects of conversational interaction. In an appendix, Martell provides a brief overview of other tools, schemes and methods for gesture analysis. Corradini and Cohen (Chapter 5) report on a Wizard-of-Oz study in which it was investigated how people use gesture and speech during interaction with a video game when they do not have access to standard input devices. The subjects’ interaction with the game was recorded, transcribed, further coded, and analysed. The primary analysis of the data concerns the users’ use of speech-only, gesture-only and speech and gesture in combination. Moreover, subjective data were collected by asking subjects after the experiment about their modality preferences during interaction. Although subjects’ answers and their actual behaviour in the experiment did not always match, the study indicates a preference for multimodal interaction. The chapter by Healey et al. (Chapter 6) addresses the analysis of humanhuman interaction. The motivation is the lack of support for the design of systems for human-human interaction. They discuss two psycholinguistic approaches and propose a third approach based on the detection and resolution of communication problems. This approach is useful for measuring the effectiveness of human-human interaction across tasks and modalities. A coding protocol for identification of repair phenomena across different modalities is presented followed by evaluation results from testing the protocol on a small corpus of repair sequences. The presented approach has the potential to help in judging the effectiveness of multimodal communication. Carbonell and Kieffer (Chapter 7) report on an experimental study which investigated if oral messages facilitate visual search tasks on a crowded display. Using the mouse, subjects were asked to search and select visual targets in complex scenes presented on the screen. Before the presentation of each scene, the subject would either see the target alone, receive an oral description of the target and spatial information about its position in the scene, or get a combination of the visual and oral target descriptions. Analysis of the data collected suggests that appropriate oral messages can improve search accuracy as well as selection time. However, both objectively and subjectively, multimodal messages were most effective. Darrell et al. (Chapter 8) discuss the problem of knowing who is speaking during multi-speaker interaction with a computer. They present two methods based on geometric and statistical source separation approaches, respectively. These methods are used for audiovisual segmentation of multiple speakers and have been used in experiments. One setup in a conference room used several stereo cameras and a ceiling-mounted microphone array grid. In this case a
Natural and Multimodal Interactivity Engineering - Directions and Needs
5
geometric method was used to identify the speaker. In a second setup, involving use of a handheld device or a kiosk, a single camera and a single omnidirectional microphone were used and a statistical method applied for speaker identification. Data analysis showed that each approach was valuable in the intended domain. However, the authors propose that a combination of the two methods would be of benefit and initial integration efforts are discussed.
2.3
Animated Talking Heads and Evaluation
Three chapters address animated talking heads. These chapters all include experimental data collection and data analysis, and present evaluations of various aspects of talking head technology and its usability. The chapter by Massaro (Chapter 9) concerns computer-assisted speech and language tutors for the deaf, hard of hearing and autistic children. The author presents an animated talking head for visible speech synthesis. The skin of the head can be made transparent so that one can see the tongue and the palate. The technology has been used in a language training program in which the agent guides the user through a number of exercises in vocabulary and grammar. The aim is to improve speech articulation and develop linguistic and phonological awareness in the users. The reported experiments show positive learning results. Granström and House (Chapter 10) describe work on animated talking heads, focusing on the increased intelligibility and efficiency provided by the addition of the talking face which uses text-to-speech synthesis. Results from various studies are presented. The studies include intelligibility tests and perceptual evaluation experiments. Among other things, facial cues to convey, e.g., feedback, turn-taking, and prosodic functions like prominence have been investigated. A number of applications of the talking head are described, including a language tutor. Heylen et al. (Chapter 11) report on how different eye gaze behaviours of a cartoon-like talking face affect the interaction with users. Three versions of the talking face were included in an experiment in which users had to make concert reservations by interacting with the talking face through typed input. One version of the face was aimed to be a close approximation to human-like gaze behaviour. In a second version gaze shifts were kept minimal, and in a third version gaze shifts were random. Evaluation of data from the experiment clearly showed a better performance of, and preference for, the human-like version.
6
2.4
Advances in Natural Multimodal Dialogue Systems
Architectures and Technologies for Advanced and Adaptive Multimodal Dialogue Systems
The last part of this book comprises five chapters which all have a strong focus on aspects of developing advanced multimodal dialogue systems. Several of the chapters present architectures which may be reused across applications, while others emphasise learning and adaptation. The chapter by Chai et al. (Chapter 12) addresses the difficult task of interpreting multimodal user input. The authors propose to use a fine-grained semantic model that characterises the meaning of user input and the overall conversation, and an integrated interpretation approach drawing on context knowledge, such as conversation histories and domain knowledge. These two approaches are discussed in detail and are included in the multimodal interpretation framework presented. The framework is illustrated by a real-estate application in which it has been integrated. Clark et al. (Chapter 13) discuss the application of a general-purpose architecture in support of multimodal interaction with complex devices and applications. The architecture includes speech recognition, natural language understanding, text-to-speech synthesis, an architecture for conversational intelligence, and use of the Open Agent Architecture. The architecture takes advantage of reusability and has been deployed in a number of dialogue systems. The authors report on its deployment in an intelligent tutoring system for shipboard damage control. Details about the tutoring system and the architecture are presented. Reithinger et al. (Chapter 14) present a multimodal system for access to multimedia databases on small handheld devices. Interaction in three languages is supported. Emphasis is on haptic interaction via active buttons combined with spoken input and visual and acoustic output. The overall architecture of the system is explained and so is the format for data exchange between modules. Also, the dialogue manager is described, including its architecture and multimodal fusion issues. Dusan and Flanagan (Chapter 15) address the difficult issue of ensuring sufficient vocabulary coverage in a spoken dialogue system. To overcome the problem that there may always be a need for additional words or word forms, the authors propose a method for adapting the vocabulary of a spoken dialogue system at run-time. Adaptation is done by the user by adding new concepts to existing pre-programmed concept classes and by providing semantic information about the new concepts. Multiple input modalities are available for doing the adaptation. Positive results from preliminary experiments with the method are reported. Wilks et al. (Chapter 16) discuss machine learning approaches to the modelling of human-computer interaction. They first describe a dialogue manager
Natural and Multimodal Interactivity Engineering - Directions and Needs
7
built for multimodal dialogue handling. The dialogue manager uses a kind of stereotypical dialogue patterns, called Dialogue Action Frames, for representation. The authors then describe an analysis module which learns to assign dialogue acts and semantic contents from corpora. The idea is to enable automatic derivation of Dialogue Action Frames, so that the dialog manager will be able to use Dialogue Action Frames that are automatically leaned from corpora.
3.
NMIE Contributions by the Included Chapters
Using the righthand column entries of Table 1.1, Table 1.2 indicates how the 15 chapters in this book contribute to the NMIE field. A preliminary conclusion based on Table 1.2 is that, for an emerging field which is only beginning to be exploited commercially, the NMIE research being done world-wide today is already pushing the frontiers in many of the directions needed. In the following, we discuss the chapter contributions to each of the lefthand entries in Table 1.2.
3.1
Applicable Theory
It may be characteristic of the NMIE field at present that our sample of papers only includes a single contribution of a primarily theoretical nature, i.e. Healey et al., which applies a psycholinguistic model of dialogue to help identify a subset of communication problems in order to judge the effectiveness of multimodal communication. Human-machine communication problems, their nature and identification by human or machine, has recently begun to attract the attention of more than a few NMIE researchers, and it has become quite clear that we need far better understanding of miscommunication in natural and multimodal interaction than we have at present. However, the relative absence of theoretical papers is not characteristic in the sense that the field does not make use of, or even need, applicable theory. On the contrary, a large number of chapters actually do apply existing theory in some form, ranging from empirical generalisations to full-fledged theory of many different kinds. For instance, Bickmore and Cassell test generalisations on the effects on communication of involving embodied conversational agents; Carbonell and Kieffer apply modality theory; Chai et al. apply theories of human-human dialogue to the development of a fined-grained, semanticsbased multimodal dialogue interpretation framework; Massaro applies theories of human learning; and Sidner and Dzikovska draw on conversation and collaboration theory.
8
Advances in Natural Multimodal Dialogue Systems
Table 1.2. NMIE needs addressed by the chapters in this book. Specific to NMIE Applicable theory Empirical work analysis
Contributions and
Annotation and analysis Future visions Enabling technologies
More advanced systems
Make it easy
Evaluate
3.2
No new theory except 6 but plenty of applied theory, e.g. 2, 3. Effects on communication of animated conversational agents. 2, 10, 11 Spoken input in support of visual search. 7 Gesture and speech for video game playing. 5 Multi-speaker speech recognition. 8 Gaze behaviour for more likeable animated interface agents. 2, 11 Audio-visual speech output. 9, 10 Animated talking heads for language learning. 9, 10 Tutoring. 3, 9, 10, 13 Hosting robots. 3 Coding scheme for conversational interaction research. 4 Standard for internal representation of NMIE data codings. 4 Many papers with visions concerning new challenges in their research. Interactive robotics: robots controlled multimodally, tutoring and hosting robots. 3 Multi-speaker speech recognition. 8 Audio-visual speech synthesis for talking heads. 9, 10 Machine learning of language and dialogue acts assignment. 15, 16 Multilinguality. 14 Ubiquitous (mobile) application. 14 On-line observation-based user modelling for adaptivity. 12, 14 Complex natural interactive dialogue management. 12, 13, 14, 16 Machine learning for more advanced dialogue systems. 15, 16 Platform for natural interactivity. 6 Re-usable components (many papers). Architectures for multimodal systems and dialogue management. 3, 12, 13, 14, 16, 16 Multimodal interface language. 14 XML for data exchange. 10, 14 Effects on communication of animated conversational agents. 2, 10, 11 Evaluations of talking heads. 9, 10, 11 Evaluation of audio-visual speech synthesis for learning. 9, 10
Empirical Work and Analysis
Novel theory tends to be preceded by empirical exploration and generalisation. The NMIE field is replete with empirical studies of human-human and human-computer natural and multimodal interaction [Dehn and van Mulken,
Natural and Multimodal Interactivity Engineering - Directions and Needs
9
2000]. By their nature, empirical studies are closer to the process of engineering than is theory development. We build NMIE research systems not only from theory but, perhaps to a far greater extent, from hunches, contextual assumptions, extrapolations from previous experience and untried transfer from different application scenarios, user groups, environments, etc., or even Wizard of Oz studies, which are in themselves a form of empirical study, see, e.g., the chapter by Corradini and Cohen and [Bernsen et al., 1998]. Having built a prototype system, we are keen to find out how far those hunches, etc. got us. Since empirical testing, evaluation, and assessment are integral parts of software and systems engineering, all we have to do is to include ”assumptions testing” in the empirical evaluation of the implemented system which we would be doing anyway. The drawback of empirical studies is that they usually do not generalise much due to the multitude of independent variables involved. This point is comprehensively argued and illustrated for the general case of multimodal and natural interactive systems which include speech in [Bernsen, 2002]. Still, as we tend to work on the basis of only slightly fortified hunches anyway, the results could often serve to inspire fellow researchers to follow them up. Thus, best-practice empirical studies are of major importance in guiding NMIE progress. The empirical chapters in this book illustrate well the points made above. One cluster of findings demonstrate the potential of audio-visual speech output by animated talking heads for child language learning (Massaro) and, more generally, for improving intelligibility and efficiency of human-machine communication, including the substitution of facial animation for the, still-missing, prosody in current speech synthesis systems (Granström and House). In counter-point, so to speak, Darrell et al. convincingly demonstrate the advantage of audio-visual input for tackling an important next step in speech technology, i.e. the recognition of multi-speaker spoken input. Jointly, the three chapters do a magnificent job of justifying the need for natural and multimodal (audiovisual) interaction independently of any psychological or social-psychological argument in favour of employing animated conversational agents. A key question seems to be: for which purpose(s), other than harvesting the benefits of using audio-visual speech input/output described above, do we need to accompany spoken human-computer dialogue with more or less elaborate animated conversational interface agents [Dehn and van Mulken, 2000]? By contrast with spoken output, animated interface agents occupy valuable screen real estate, do not necessarily add information of importance to the users of large classes of applications, and may distract the user from the task at hand. Whilst a concise and comprehensive answer to this question is still pending, Bickmore and Cassell go a long way towards explaining that the introduction of life-like animated interface agents into human-computer spoken dialogue is
10
Advances in Natural Multimodal Dialogue Systems
a tough and demanding proposition. As soon as an agent appears on the display, users tend to switch expectations from talking to a machine to talking to a human. By comparison, the finding of Heylen et al. that users tend to appreciate an animated cartoon agent more if it shows a minimum of human-like gaze behaviour might speak in favour of preferring cartoon-style agents over life-like animated agents because the former do not run the risk of facing our full expectations to human conversational behaviour. Sidner and Dzikovska do not involve a virtual agent but, rather, a robot in the dialogue with the user, so they do not have the problem of an agent occupying part of the screen. But they have still have the behavioural problems of the robot to look into just as, by close analogy, do the people who work with virtual agents. The experiments by Sidner and Dzikovska show that there is still a long way to go before we fully understand and can model the subtle details of human behaviour in dialogue. On the multimodal side of the natural interactivity/multimodality semidivide, several papers address issues of modality collaboration, i.e., how the use of modality combinations could facilitate, or even enable, human-computer interaction tasks that could not be done easily, if at all, using unimodal interaction. Carbonell and Kieffer report on how combined speech and graphics output can facilitate display search, and Corradini and Cohen show how the optional use of different input modalities can improve interaction in a particular virtual environment.
3.3
Annotation and Analysis
It is perhaps not surprising that we are not very capable of predicting what people will do, or how they will behave, when interacting with computer systems using new modality combinations and possibly also new interactive devices. More surprising, however, is the fact that we are often just as ignorant when trying to predict natural interactive behaviours which we have the opportunity to observe every day in ourselves and others, such as: which kinds of gestures, if any, do people perform when they are listening to someone else speaking? This example illustrates that, to understand the ways in which people communicate with one another as well as the ways in which people communicate with the far more limited, current NMI systems, we need extensive studies of behavioural data. The study of data on natural and multimodal interaction is becoming a major research area full of potential for new discoveries. A number of chapters make use of, or refer to, data resources for NMIE, but none of them take a more general view on data resource issues. One chapter addresses NMIE needs for new coding schemes. Martell presents a kinematically-based gesture annotation scheme for capturing the kinematic information in gestures from videos of speakers. Linking the urgent issue of
Natural and Multimodal Interactivity Engineering - Directions and Needs
11
new, more powerful coding tools with the equally important issue of standardisation, Martell proposes a standard for the internal representation of NMIE codings.
3.4
Future Visions
None of the chapters have a particular focus on future visions for the NMIE field. However, many authors touch on future visions, e.g., their descriptions of future work and what they would like to achieve. This includes the important driving role of re-usable platforms, tools, and components for making rapid progress. Moreover, there are several hints at the future importance to NMIE of two generic enabling technologies which are needed to extend spoken dialogue systems to full natural interactive systems. These technologies are (i) computer vision for processing camera input, and (ii) computer animation systems. It is only recently that the computer vision community has begun to address issues of natural interactive and multimodal human-system communication, and there is a long way to go before computer vision can parallel speech recognition as a major input medium for NMIE. The chapters by Massaro and Granström and House illustrate current NMIE efforts to extend natural and multimodal interaction beyond traditional information systems to new major application areas, such as training and education which has been around for a while already, notably in the US-dominated paradigm of tutoring systems using animated interface agents, but also to edutainment and entertainment. While the GUI, including the current WWW, might be said to have the edutainment potential of a schoolbook or newspaper, NMIE systems have the much more powerful edutainment potential of brilliant teachers, comedians, and exiting human-human games.
3.5
Enabling Technologies
An enabling technology is often developed over a long time by some separate community, such as by the speech recognition community from the 1950s to the late 1980s. Having matured to the point at which practical applications become possible, the technology transforms into an omnipresent tool for system developers, as is the case with speech recognition technology today. NMIE needs a large number of enabling technologies and these are currently at very different stages of maturity. Several enabling technologies, some of which are at an early stage and some of which are finding their way into applications, are presented in this book in the context of their application to NMIE problems, including robot interaction and agent technology, multi-speaker interaction and recognition, machine learning, and talking face technology. Sidner and Dzikovska focus on robot interaction in the general domain of ”hosting”, i.e., where a virtual or physical agent provides guidance, education,
12
Advances in Natural Multimodal Dialogue Systems
or entertainment based on collaborative goals negotiation and subsequent action. A great deal of work remains to be done before robot interaction becomes natural in any approximate sense of the term. For instance, the robot’s spoken dialogue capabilities must be strongly improved and so must its embodied appearance and global communicative behaviours. In fact, Sidner and Dzikovska make some of the same conclusions as Bickmore and Cassell, namely that agents need to become far more human-like in all or most respects before they are really appreciated by humans. Darrell et al. address the problem in multi-speaker interaction of knowing who is addressing the computer when. Their approach is to use a combination of microphones and computer vision to find out who is talking. Developers of spoken dialogue applications must cope with problems resulting from vocabulary and grammar limitations and from difficulties in enabling much of the flexibility and functionality inherent in human-human communication. Despite having carried out systematic testing, the developer often finds that words are missing when a new user addresses the application. Dusan and Flanagan propose machine learning as a way to overcome part of this problem. Using machine learning, the system can learn new words and grammars taught to it by the user in a well-defined way. Wilks et al. address machine learning or transformation-based learning - in the context of assigning dialogue acts as part of an approach to improved dialogue modelling. In another part of their approach, Wilks et al. consider the use of dialogue action frames, i.e., a set of stereotypical dialogue patterns which perhaps may be learned from corpus data, as a means for flexibly switching back and forth between topics during dialogue. Granström and House and Massaro describe the gain in intelligibility that can be obtained by combining speech synthesis with a talking face. There is still much work to do both on synthesis and face articulation. For most languages, speech synthesis is still not very natural to listen to and if one wants to develop a particular voice to fit a certain animated character, this is not immediately possible with today’s technology. With respect to face articulation, faces need to become much more natural in terms of, e.g., gaze, eyebrow movements, lip and mouth movements, and head movements, as this seems to influence users’ perception of the interaction, cf. the chapters by Granström and House and Heylen et al.
3.6
More Advanced Systems
Enabling technologies for NMIE are often component technologies, and their description, including state of the art, current research challenges, and unsolved problems, can normally be made in a relatively systematic and focused manner. It is far more difficult to systematically describe the complexity
Natural and Multimodal Interactivity Engineering - Directions and Needs
13
of the constant push in research and industry towards exploring and exploiting new NMIE application types and new application domains, addressing new user populations, increasing the capabilities of systems in familiar domains of application, exploring known technologies with new kinds of devices, etc. In general, the picture is one of pushing present boundaries in all or most directions. During the past few years, a core trend in NMIE has been to combine different modalities in order to build more complex, versatile and capable systems, getting closer to natural interactivity than is possible with only a single modality. This trend is reflected in several chapters. Part of the NMIE paradigm is that systems must be available whenever and wherever convenient and useful, making ubiquitous computing an important application domain. Mobile devices, such as mobile phones, PDAs, and portable computers of any (portable) size have become popular and are rapidly gaining functionality. However, the interface and interaction affordances of small devices require careful consideration. Reithinger et al. present some of those considerations in the context of providing access to large amounts of data about music. It can be difficult for users to know how to interact with new NMIE applications. Although not always very successful in practice, the classical GUI system has the opportunity to present its affordances in static graphics (including text) before the user chooses how to interact. A speech-only system, by contrast, cannot do that because of the dynamic and transitory nature of acoustic modalities. NMIE systems, in other words, pose radically new demands on how to support the user prior to, and during, interaction. Addressing this problem, several chapters mention user modelling or repositories of user preferences built on the basis of interactions with a system, cf. the chapters by Chai et al. and Reithinger et al. Machine learning, although another example of less-than-expected pace of development during the past 10 years, has great potential for increasing interaction support. In an advanced application of machine learning, Dusan and Flanagan propose to increase the system’s vocabulary and grammar by letting users teach the system new words and their meaning and use. Wilks et al. use machine learning as part of an approach to more advanced dialogue modelling. Increasingly advanced systems require increasingly complex dialogue management, cf. the chapters by Chai et al., Clark et al., and Wilks et al. Lifelikeness of animated interface agents and conversational dialogue are among the key challenges in achieving the NMIE vision. Multilinguality of systems is an important NMIE goal. Multilingual applications are addressed by Reithinger et al. In their case, the application is running on a handheld device. Multi-speaker input speech is mentioned by Darrell et al. For good reason, recognition of multi-speaker input has become a lively research topic. We
14
Advances in Natural Multimodal Dialogue Systems
need solutions in order to, e.g., build meeting minute-takers, separate the focal speaker’s input from that of other speakers, exploit the huge potential of spoken multi-user applications, etc.
3.7
Make It Easy
Due to the complexity of multimodal natural interaction, it is becoming dramatically important to be able to build systems as easily as possible. It seems likely that no single research lab or development team in industry, even including giants such as Microsoft, is able to master all of the enabling technologies required for broad-scale NMIE progress. To advance efficiently, everybody needs access to those system components, and their built-in know-how, which are not in development focus. This implies strong attention to issues, such as re-usable platforms, components and architectures, development toolkits, interface languages, data formats, and standardisation. Clark et al. have used the Open Agent Architecture (OAA, http://www.ai.sri.com/∼oaa/), a framework for integrating heterogeneous software agents in a distributed environment. What OAA and other architectural frameworks, such as CORBA (http://www.corba.org/), aim to do is provide a means for modularisation, synchronous and asynchronous communication, well-defined inter-module communication via some interface language, such as IDL (CORBA) or ICL (OAA), and the possibility of implementation in a distributed environment. XML (Extensible Markup Language) is a simple, flexible text format derived from SGML (ISO 8879) which has become popular as, among other things, a message exchange format, cf. Reithinger et al. and Granström and House. Using XML for wrapping inter-module messages is one way to overcome the problem of different programming languages used for implementing different modules. Some chapters express a need for reusable components. Many of the applications described include off-the-shelf software, including components developed in other projects. This is particularly true for mature enabling technologies, such as speech recognition and synthesis components. As regards multimodal dialogue management, there is an expressed need for reuse in, e.g., the chapter by Clark et al. who discuss a reusable dialogue management architecture in support of multimodal interaction. In conclusion, there are architectures, platforms, and software components available which facilitate the building of new NMIE applications, and standards are underway for certain aspects. There is still much work to be done on standardisation, new and better platforms, and improvement of component software. In addition, we need, in particular, more and better toolkits in support of system development and a better understanding of those components which
Natural and Multimodal Interactivity Engineering - Directions and Needs
15
cannot be bought off-the-shelf and which are typically difficult to reuse, such as dialogue managers. Advancements such as these are likely to require significant corpus work. Corpora with tools and annotation schemes as described by Martell are exactly what is needed in this context.
3.8
Evaluate
Software systems and components evaluation is a broad area, ranging from technical evaluation over usability evaluation to customer evaluation. Customer evaluation has never been a key issue in research but has, rather, tended to be left to the marketing departments of companies. Technical evaluation and usability evaluation, including evaluation of functionality from both perspectives, are, on the other hand, considered important research issues. The chapters show a clear trend towards focusing on usability evaluation and comparative performance evaluation. It is hardly surprising that performance evaluation and usability issues are considered key topics today. We know little about what happens when we move towards increasingly multimodal and natural interactive systems, both as regards how these new systems will perform compared to alternative solutions and how the systems will be received and perceived by their users. We only know that a technically optimal system is not sufficient to guarantee user satisfaction. Comparative performance evaluation objectively compares users’ performance on different systems with respect to, e.g., how well they understand speech-only versus speech combined with a talking face or with an embodied animated agent, cf. Granström and House, Massaro, and Bickmore and Cassell. The usability issues evaluated all relate to users’ perception of a particular system and include parameters, such as life-likeness, credibility, reliability, efficiency, personality, ease of use, and understanding quality, cf. Heylen et al. and Bickmore and Cassell. Two chapters address how the intelligibility of what is being said can be increased through visual articulation, cf. Granström and House and Massaro. Granström and House have used a talking head in several applications, including tourist information, real estate (apartment) search, aid for the hearing impaired, education, and infotainment. Evaluation shows a significant gain in intelligibility for the hearing impaired. Eyebrow and head movement enhance perception of emphasis and syllable prominence. Over-articulation may be useful as well when there are special needs for intelligibility. The findings of Massaro support these promising conclusions. His focus is on applications for the hard-of-hearing, children with autism, and child language learning more generally. Granström and House also address the increase in efficiency of communication/interaction produced by using an animated talking head. Probably,
16
Advances in Natural Multimodal Dialogue Systems
naturalness is a key point here. This is suggested by Heylen et al. who made controlled experiments on the effects of different eye gaze behaviours of a cartoon-like talking face on the quality of human-agent dialogues. The most human-like agent gaze behaviour led to higher appreciation of the agent and more efficient task performance. Bickmore and Cassell evaluate the effects on communication of an embodied conversational real-estate agent versus an over-the-phone version of the same system, cf. also [Cassell et al., 2000]. In each condition, two variations of the system was available. One would be fully task-oriented while the second version would include some small-talk options. In general, users liked the system better in the phone condition. In the phone condition, subjects appreciated the small-talk whereas, in the embodied condition, subjects wanted to get down to business. The implication is that agent embodiment has strong effects on the interlocutors. Users tend to compare their animated interlocutors with humans rather than machines. To work with users, animated agents need considerably more naturalness and personally attractive features communicated non-verbally. This imposes a tall research agenda on both speech and non-verbal output, requiring conversational abilities both verbally and nonverbally. Jointly, the chapters on evaluation demonstrate a broad need for performance evaluation, comparative as well as non-comparative, that can inform us on the possible benefits and shortcomings of new natural interactive and multimodal systems. The chapters show a similar need for usability evaluation that can help us find out how users perceive these new systems, and a need for finding ways in which usability and user satisfaction might be correlated with technical aspects in order for the former to be derived from the latter.
4.
Multimodality and Natural Interactivity
Conceptually, NMIE combines natural interactive and multimodal systems and components engineering. While both concepts, natural interactivity and multimodality, have a long history, it would seem that they continue to sit somewhat uneasily side by side in the minds of most of us. Multimodality is the idea of being able to choose any input/output modality or combination of input/output modalities for optimising interaction with the application at hand, such as speech input for many heads-up, hands-occupied applications, speech and haptic input/output for applications for the blind, etc. A modality is a particular way of representing input or output information in some physical medium, such as something touchable, light, sound, or the chemistry for producing olfaction and gustation [Bernsen, 2002], see also the chapter by Carbonell and Kieffer. The physical medium of the speech modalities, for instance, is sound or acoustics but this medium obviously enables the trans-
Natural and Multimodal Interactivity Engineering - Directions and Needs
17
mission of information in many acoustic modalities other than speech, such as earcons, music, etc. The term multimodality thus refers to any possible combination of elementary or unimodal modalities. Compared to multimodality, the notion of natural interactivity appears to be the more focused of the two. This is because natural interactivity comes with a focused vision of the future of interaction with computer systems as well as a relatively well-defined set of modalities required for the vision to become reality. The natural interactivity vision is that of humans communicating with computer systems in the same ways in which humans communicate with one another. Thus, natural interactivity specifically emphasises human-system communication involving the following input/output modalities used in situated human-human communication: speech, gesture, gaze, facial expression, head and body posture, and object manipulation as integral part of the communication (or dialogue). As the objects being manipulated may themselves represent information, such as text and graphics input/output objects, natural interaction subsumes the GUI paradigm. Technologically, the natural interactivity vision is being pursued vigorously by, among others, the emerging research community in talking faces and embodied conversational agents as illustrated in the chapters by Bickmore and Cassell, Granström and House, Heylen et al., and Massaro. An embodied conversational agent may be either virtual or a robot, cf. the chapter by Sidner and Dzikovska. A weakness in our current understanding of natural interactivity is that it is not quite clear where to draw the boundary between the natural interactivity modalities and all those other modalities and modality combinations which could potentially be of benefit to human-system interaction. For instance, isn’t pushing a button on the mouse or otherwise, although never used in humanhuman communication for the simple reason that humans do not have communicative buttons on them, as natural as speaking? If it is, then, perhaps, all or most research on useful multimodal input/output modality combinations is also research into natural interactivity even if the modalities addressed are not being used in human-human communication? In addition to illustrating the need for more and better NMIE theory, the point just made may explain the uneasy conceptual relationship among the two paradigms of natural interactivity and multimodality. In any case, we have decided to combine the paradigms and address them together as natural and multimodal interactivity engineering. Finally, by NMI ’engineering’ we primarily refer to software engineering. It follows that the expression ’natural and multimodal interactivity engineering’ primarily represents the idea of creating a specialised branch of software engineering for the field addressed in this book. It is important to add, however, that NMIE enabling technologies are being developed in fields whose practitioners do not tend to regard themselves as doing software engineering, such as signal processing. For instance, the recently launched European Network of
18
Advances in Natural Multimodal Dialogue Systems
Excellence SIMILAR (http://www.similar.cc) addresses signal processing for natural and multimodal interaction.
4.1
Modalities Investigated in This Book
We argued above that multimodality includes all possible modalities for the representation and exchange of information among humans and between humans and computer systems, and that natural interactivity includes a rather vaguely defined, large sub-set of those modalities. Within this wide space of unimodal modalities and modality combinations, it may be useful to look at the modalities actually addressed in the following chapters. These are summarised by chapter in Table 1.3. Table 1.3. Modalities addressed in the included chapters (listed by chapter number plus first listed author). Chapter
Input
Output
2. Bickmore
speech, gesture (via camera) vs. speech-only speech, mouse clicks, (new version includes face and gesture input via camera)
embodied conversational agent + images vs. speech-only + images robot pointing and beat gestures, speech, gaze N/A video game
6. Healey 7. Carbonell 8. Darrell
gesture speech, gesture, object manipulation/manipulative gesture N/A gesture (mouse) speech, camera-based graphics
9. Massaro
mouse, speech
10. Granström
N/A
11. Heylen
typed text
12. Chai 13. Clark 14. Reithinger
speech, text, gesture (pointing) speech, gesture (pointing) speech, haptic buttons
15. Dusan
speech, keyboard, mouse, penbased drawing and pointing, camera speech and possibly other modalities
3. Sidner
4. Martell 5. Corradini
16. Wilks
N/A speech, graphics N/A audio-visual speech synthesis, talking head, images, text audio-visual speech synthesis, talking head talking head, gaze speech, graphics speech, text, graphics music, speech, text, graphics, tactile rythm speech, graphics, text display
speech and possibly other modalities. Focus is on dialogue modelling so input/output modalities are not discussed in detail
Natural and Multimodal Interactivity Engineering - Directions and Needs
19
Combined speech input/output which, in fact, means spoken dialogue almost throughout, is addressed in about half of the chapters (Bickmore and Cassell, Chai et al., Clark et al., Corradini and Cohen, Dusan and Flanagan, Reithinger et al., and Sidner and Dzikovska). Almost two thirds of the chapters address gesture input in some form (Bickmore and Cassell, Chai et al., Clark et al., Corradini and Cohen, Darrell et al., Dusan and Flanagan, Martell, Reithinger et al., and Sidner and Dzikovska). Five chapters address output modalities involving talking heads, embodied animated agents, or robots (Bickmore and Cassell, Granström and House, Heylen et al., Massaro, and Sidner and Dzikovska). Three chapters (Darrell et al., Bickmore and Cassell, and Sidner and Dzikovska) address computer vision. Dusan and Flanagan also mention that their system has camera-based input. Facial expression of emotion is addressed by Granström and House. Despite its richness and key role in natural interactivity, input or output speech prosody is hardly discussed. Granström and House discuss graphical ways of replacing missing output speech prosody by facial expression means. In general, the input and output modalities and their combinations discussed would appear representative of the state-of-the-art in NMIE. The authors make quite clear how far we are from mastering the very large number of potentially useful unimodal ”compounds” theoretically, in input recognition, in output generation, as well as in understanding and generation.
References Bernsen, N. O. (2002). Multimodality in Language and Speech Systems - From Theory to Design Support Tool. In Granström, B., House, D., and Karlsson, I., editors, Multimodality in Language and Speech Systems, pages 93–148. Dordrecht: Kluwer Academic Publishers. Bernsen, N. O., Dybkjær, H., and Dybkjær, L. (1998). Designing Interactive Speech Systems. From First Ideas to User Testing. Springer Verlag. Cassell, J., Bickmore, T., Campbell, L., Vilhjálmsson, H., and Yan, H. (2000). Human Conversation as a System Framework: Designing Embodied Conversational Agents. In Embodied Conversational Agents, pages 29–63. Cambridge, MA: MIT Press. Dehn, D. and van Mulken, S. (2000). The Impact of Animated Interface Agents: A Review of Empirical Research. International Journal of Human-Computer Studies, 52:1–22.
PART I
MAKING DIALOGUES MORE NATURAL: EMPIRICAL WORK AND APPLIED THEORY
Chapter 2 SOCIAL DIALOGUE WITH EMBODIED CONVERSATIONAL AGENTS Timothy Bickmore Northeastern University, USA
[email protected]
Justine Cassell Northwestern University, USA
[email protected]
Abstract
The functions of social dialogue between people in the context of performing a task is discussed, as well as approaches to modelling such dialogue in embodied conversational agents. A study of an agent’s use of social dialogue is presented, comparing embodied interactions with similar interactions conducted over the phone, assessing the impact these media have on a wide range of behavioural, task and subjective measures. Results indicate that subjects’ perceptions of the agent are sensitive to both interaction style (social vs. task-only dialogue) and medium.
Keywords:
Embodied conversational agent, social dialogue, trust.
1.
Introduction
Human-human dialogue does not just comprise statements about the task at hand, about the joint and separate goals of the interlocutors, and about their plans. In human-human conversation participants often engage in talk that, on the surface, does not seem to move the dialogue forward at all. However, this talk – about the weather, current events, and many other topics without significant overt relationship to the task at hand – may, in fact, be essential to how humans obtain information about one another’s goals and plans and decide whether collaborative work is worth engaging in at all. For example, realtors use small talk to gather information to form stereotypes (a collection 23 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 23–54. © 2005 Springer. Printed in the Netherlands.
24
Advances in Natural Multimodal Dialogue Systems
of frequently co-occurring characteristics) of their clients – people who drive minivans are more likely to have children, and therefore to be searching for larger homes in neighbourhoods with good schools. Realtors – and salespeople in general – also use small talk to increase intimacy with their clients, to establish their own expertise, and to manage how and when they present information to the client [Prus, 1989]. Nonverbal behaviour plays an especially important role in such social dialogue, as evidenced by the fact that most important business meetings are still conducted face-to-face rather than on the phone. This intuition is backed up by empirical research; several studies have found that the additional nonverbal cues provided by video-mediated communication do not effect performance in task-oriented interactions, but in interactions of a more social nature, such as getting acquainted or negotiation, video is superior [Whittaker and O’Conaill, 1997]. These studies have found that for social tasks, interactions were more personalized, less argumentative and more polite when conducted via video-mediated communication, that participants believed video-mediated (and face-to-face) communication was superior, and that groups conversing using video-mediated communication tended to like each other more, compared to audio-only interactions. Together, these findings indicate that if we are to develop computer agents capable of performing as well as humans on tasks such as real estate sales then, in addition to task goals such reliable and efficient information delivery, they must have the appropriate social competencies designed into them. Further, since these competencies include the use of nonverbal behaviour for conveying communicative and social cues, then our agents must have the capability of producing and recognizing nonverbal cues in simulations of face-to-face interactions. We call agents with such capabilities “Embodied Conversational Agents” or “ECAs.” The current chapter extends previous work which demonstrated that social dialogue can have a significant impact on a user’s trust of an ECA [Bickmore and Cassell, 2001], by investigating whether these results hold in the absence of nonverbal cues. We present the results of a study designed to determine whether the psychological effects of social dialogue – namely to increase trust and associated positive evaluations – vary when the nonverbal cues provided by the embodied conversational agent are removed. In addition to varying medium (voice only vs. embodied) and dialogue style (social dialogue vs. taskonly) we also assessed and examined effects due to the user’s personality along the introversion/extroversion dimension, since extroversion is one indicator of a person’s comfort level with face-to-face interaction.
Social Dialogue with Embodied Conversational Agents
2.
25
Embodied Conversational Agents
Embodied conversation agents are animated anthropomorphic interface agents that are able to engage a user in real-time, multimodal dialogue, using speech, gesture, gaze, posture, intonation, and other verbal and nonverbal behaviours to emulate the experience of human face-to-face interaction [Cassell et al., 2000c]. The nonverbal channels are important not only for conveying information (redundantly or complementarily with respect to the speech channel), but also for regulating the flow of the conversation. The nonverbal channel is especially crucial for social dialogue, since it can be used to provide such social cues as attentiveness, positive affect, and liking and attraction, and to mark shifts into and out of social activities [Argyle, 1988].
2.1
Functions versus Behaviours
Embodiment provides the possibility for a wide range of behaviours that, when executed in tight synchronization with language, carry out a communicative function. It is important to understand that particular behaviours, such as the raising of the eyebrows, can be employed in a variety of circumstances to produce different communicative effects, and that the same communicative function may be realized through different sets of behaviours. It is therefore clear that any system dealing with conversational modelling has to handle function separately from surface-form or run the risk of being inflexible and insensitive to the natural phases of the conversation. Here we briefly describe some of the fundamental communication categories and their functional sub-parts, along with examples of nonverbal behaviour that contribute to their successful implementation. Table 2.1 shows examples of mappings from communicative function to particular behaviours and is based on previous research on typical North American nonverbal displays, mainly [Chovil, 1991; Duncan, 1974; Kendon, 1980].
Conversation initiation and termination Humans partake in an elaborate ritual when engaging and disengaging in conversations [Kendon, 1980]. For example, people will show their readiness to engage in a conversation by turning towards the potential interlocutors, gazing at them and then exchanging signs of mutual recognition typically involving a smile, eyebrow movement and tossing the head or waving of the arm. Following this initial synchronization stage, or distance salutation, the two people approach each other, sealing their commitment to the conversation through a close salutation such as a handshake accompanied by a ritualistic verbal exchange. The greeting phase ends when the two participants re-orient their bodies, moving away from a face-on orientation to stand at an angle. Terminating a conversation similarly moves through stages, starting with non-verbal cues, such as orientation shifts
26
Advances in Natural Multimodal Dialogue Systems
Table 2.1. Some examples of conversational functions and their behaviour realization [Cassell et al., 2000b]. Communicative Functions Initiation and termination Reacting Inviting Contact Distance Salutation Close Salutation Break Away Farewell Turn-Taking Give Turn Wanting Turn Take Turn Feedback Request Feedback Give Feedback
Communicative Behaviour Short Glance Sustained Glance, Smile Looking, Head Toss/Nod, Raise Eyebrows, Wave, Smile Looking, Head Nod, Embrace or Handshake, Smile Glance Around Looking, Head Nod, Wave Looking, Raise Eyebrows (followed by silence) Raise Hands into gesture space Glance Away, Start talking Looking, Raise Eyebrows Looking, Head Nod
or glances away and cumulating in the verbal exchange of farewells and the breaking of mutual gaze.
Conversational turn-taking and interruption Interlocutors do not normally talk at the same time, thus imposing a turn-taking sequence on the conversation. The protocols involved in floor management – determining whose turn it is and when the turn should be given to the listener – involve many factors including gaze and intonation [Duncan, 1974]. In addition, listeners can interrupt a speaker not only with voice, but also by gesturing to indicate that they want the turn. Content elaboration and emphasis Gestures can convey information about the content of the conversation in ways for which the hands are uniquely suited. For example, the two hands can better indicate simultaneity and spatial relationships than the voice or other channels. Probably the most commonly thought of use of the body in conversation is the pointing (deictic) gesture, possibly accounting for the fact that it is also the most commonly implemented for the bodies of animated interface agents. In fact, however, most conversations don’t involve many deictic gestures [McNeill, 1992] unless the interlocutors are discussing a shared task that is currently present. Other conversational gestures also convey semantic and pragmatic information. Beat gestures are small, rhythmic baton like movements of the hands that do not change in form with the content of the accompanying speech. They serve a pragmatic func-
Social Dialogue with Embodied Conversational Agents
27
tion, conveying information about what is “new” in the speaker’s discourse. Iconic and metaphoric gestures convey some features of the action or event being described. They can be redundant or complementary relative to the speech channel, and thus can convey additional information or provide robustness or emphasis with respect to what is being said. Whereas iconics convey information about spatial relationships or concepts, metaphorics represent concepts which have no physical form, such as a sweeping gesture accompanying “the property title is free and clear.”
Feedback and error correction During conversation, speakers can nonverbally request feedback from listeners through gaze and raised eyebrows and listeners can provide feedback through head nods and paraverbals (“uh-huh”, “mmm”, etc.) if the speaker is understood, or a confused facial expression or lack of positive feedback if not. The listener can also ask clarifying questions if they did not hear or understand something the speaker said.
2.2
Interactional versus Propositional Behaviours
The mapping from form (behaviour) to conversational function relies on a fundamental division of conversational goals: contributions to a conversation can be propositional and interactional. Propositional information corresponds to the content of the conversation. This includes meaningful speech as well as hand gestures and intonation used to complement or elaborate upon the speech content (gestures that indicate the size in the sentence “it was this big” or rising intonation that indicates a question with the sentence “you went to the store”). Interactional information consists of the cues that regulate conversational process and includes a range of nonverbal behaviours (quick head nods to indicate that one is following) as well as regulatory speech (“huh?”, “Uhhuh”). This theoretical stance allows us to examine the role of embodiment not just in task- but also process-related behaviours such as social dialogue [Cassell et al., 2000b].
2.3
REA
Our platform for conducting research into embodied conversational agents is the REA system, developed in the Gesture and Narrative Language Group at the MIT Media Lab [Cassell et al., 2000a]. REA is an embodied, multi-modal real-time conversational interface agent which implements the conversational protocols described above in order to make interactions as natural as face-toface conversation with another person. In the current task domain, REA acts as a real estate salesperson, answering user questions about properties in her database and showing users around the virtual houses (Figure 2.1).
28
Advances in Natural Multimodal Dialogue Systems
Figure 2.1.
User interacting with REA.
REA has a fully articulated graphical body, can sense the user passively through cameras and audio input, and is capable of speech with intonation, facial display, and gestural output. The system currently consists of a large projection screen on which REA is displayed and which the user stands in front of. Two cameras mounted on top of the projection screen track the user’s head and hand positions in space. Users wear a microphone for capturing speech input. A single SGI Octane computer runs the graphics and conversation engine of REA, while several other computers manage the speech recognition and generation and image processing. REA is able to conduct a conversation describing the features of the task domain while also responding to the users’ verbal and non-verbal input. When the user makes cues typically associated with turn taking behaviour such as gesturing, REA allows herself to be interrupted, and then takes the turn again when she is able. She is able to initiate conversational error correction when she misunderstands what the user says, and can generate combined voice, facial expression and gestural output. REA’s responses are generated by an incremental natural language generation engine based on [Stone and Doran, 1997] that has been extended to synthesize redundant and complementary gestures synchronized with speech output [Cassell et al., 2000b]. A simple discourse
Social Dialogue with Embodied Conversational Agents
29
model is used for determining which speech acts users are engaging in, and resolving and generating anaphoric references.
3.
Social Dialogue
Social dialogue is talk in which interpersonal goals are foregrounded and task goals – if existent – are backgrounded. One of the most familiar contexts in which social dialogue occurs is in human social encounters between individuals who have never met or are unfamiliar with each other. In these situations conversation is usually initiated by “small talk” in which “light” conversation is made about neutral topics (e.g., weather, aspects of the interlocutor’s physical environment) or in which personal experiences, preferences, and opinions are shared [Laver, 1981]. Even in business or sales meetings, it is customary (at least in American culture) to begin with some amount of small talk before “getting down to business”.
3.1
The Functions of Social Dialogue
The purpose of small talk is primarily to build rapport and trust among the interlocutors, provide time for them to “size each other up”, establish an interactional style, and to allow them to establish their reputations [Dunbar, 1996]. Although small talk is most noticeable at the margins of conversational encounters, it can be used at various points in the interaction to continue to build rapport and trust [Cheepen, 1988], and in real estate sales, a good agent will continue to focus on building rapport throughout the relationship with a buyer [Garros, 1999]. Small talk has received sporadic treatment in the linguistics literature, starting with the seminal work of Malinowski who defined “phatic communion” as “a type of speech in which ties of union are created by a mere exchange of words”. Small talk is the language used in free, aimless social intercourse, which occurs when people are relaxing or when they are accompanying “some manual work by gossip quite unconnected with what they are doing” [Malinowski, 1923]. Jacobson also included a “phatic function” in his well-known conduit model of communication, that function being focused on the regulation of the conduit itself (as opposed to the message, sender, or receiver) [Jakobson, 1960]. More recent work has further characterized small talk by describing the contexts in which it occurs, topics typically used, and even grammars which define its surface form in certain domains [Cheepen, 1988; Laver, 1975; Schneider, 1988]. In addition, degree of “phaticity” has been proposed as a persistent goal which governs the degree of politeness in all utterances a speaker makes, including task-oriented ones [Coupland et al., 1992].
30
3.2
Advances in Natural Multimodal Dialogue Systems
The Relationship between Social Dialogue and Trust
Figure 2.2 outlines the relationship between small talk and trust. REA’s dialogue planner represents the relationship between her and the user using a multi-dimensional model of interpersonal relationship based on [Svennevig, 1999]: familiarity describes the way in which relationships develop through the reciprocal exchange of information, beginning with relatively nonintimate topics and gradually progressing to more personal and private topics. The growth of a relationship can be represented in both the breadth (number of topics) and depth (public to private) of information disclosed [Altman and Taylor, 1973]. solidarity is defined as “like-mindedness” or having similar behaviour dispositions (e.g., similar political membership, family, religions, profession, gender, etc.), and is very similar to the notion of social distance used by Brown and Levinson in their theory of politeness [Brown and Levinson, 1978]. There is a correlation between frequency of contact and solidarity, but it is not necessarily a causal relation [Brown and Levinson, 1978; Brown and Gilman, 1972]. affect represents the degree of liking the interactants have for each other, and there is evidence that this is an independent relational attribute from the above three [Brown and Gilman, 1989]. The mechanisms by which small talk are hypothesized to effect trust include facework, coordination, building common ground, and reciprocal appreciation.
Facework The notion of “face” is “the positive social value a person effectively claims for himself by the social role others assume he has taken during a particular contact” [Goffman, 1967]. Interactants maintain face by having their social role accepted and acknowledged. Events which are incompatible with their line are “face threats” and are mitigated by various corrective measures if they are not to lose face. Small talk avoids face threat (and therefore maintains solidarity) by keeping conversation at a safe level of depth. Coordination The process of interacting with a user in a fluid and natural manner may increase the user’s liking of the agent, and user’s positive affect, since the simple act of coordination with another appears to be deeply gratifying. “Friends are a major source of joy, partly because of the enjoyable things they do together, and the reason that they are enjoyable is perhaps the coordination.” [Argyle, 1990]. Small talk increases coordination between the two participants by allowing them to synchronize short units of talk and nonverbal acknowledgement (and therefore leads to increased liking and positive affect).
Social Dialogue with Embodied Conversational Agents
31
Figure 2.2. How small talk effects trust [Cassell and Bickmore, 2003].
Building common ground Information which is known by all interactants to be shared (mutual knowledge) is said to be in the “common ground” [Clark, 1996]. The principle way for information to move into the common ground is via face-to-face communication, since all interactants can observe the recognition and acknowledgment that the information is in fact mutually shared. One strategy for effecting changes to the familiarity dimension of the relationship model is for speakers to disclose personal information about themselves – moving it into the common ground – and induce the listener to do the same. Another strategy is to talk about topics that are obviously in the common ground – such as the weather, physical surroundings, and other topics available in the immediate context of utterance. Small talk establishes common ground (and therefore increases familiarity) by discussing topics that are clearly in the context of utterance. Reciprocal appreciation In small talk, demonstrating appreciation for and agreement with the contributions of one’s interlocutor is obligatory. Performing this aspect of the small talk ritual increases solidarity by showing mutual agreement on the topics discussed.
32
3.3
Advances in Natural Multimodal Dialogue Systems
Nonverbal Behaviour in Social Dialogue
According to Argyle, nonverbal behaviour is used to express emotions, to communicate interpersonal attitudes, to accompany and support speech, for self presentation, and to engage in rituals such as greetings [Argyle, 1988]. Of these, coverbal and emotional display behaviours have received the most attention in the literature on embodied conversational agents and facial and character animation in general, e.g. [Cassell et al., 2000c]. Next to these, the most important use of nonverbal behaviour in social dialogue is the display of interpersonal attitude [Argyle, 1988]. The display of positive or negative attitude can greatly influence whether we approach someone or not and our initial perceptions of them if we do. The most consistent finding in this area is that the use of nonverbal “immediacy behaviours” – close conversational distance, direct body and facial orientation, forward lean, increased and direct gaze, smiling, pleasant facial expressions and facial animation in general, nodding, frequent gesturing and postural openness – projects liking for the other and engagement in the interaction, and is correlated with increased solidarity [Argyle, 1988; Richmond and McCroskey, 1995]. Other nonverbal aspects of “warmth” include kinesic behaviours such as head tilts, bodily relaxation, lack of random movement, open body positions, and postural mirroring and vocalic behaviours such as more variation in pitch, amplitude, duration and tempo, reinforcing interjections such as “uh-huh” and “mm-hmmm”, greater fluency, warmth, pleasantness, expressiveness, and clarity and smoother turn-taking [Andersen and Guerrero, 1998]. In summary, nonverbal behaviour plays an important role in all face-to-face interaction – both conveying redundant and complementary propositional information (with respect to speech) and regulating the structure of the interaction. In social dialogue, however, it provides the additional, and crucial, function, of conveying attitudinal information about the nature of the relationship between the interactants.
4. 4.1
Related Work Related Work on Embodied Conversational Agents
Work on the development of ECAs, as a distinct field of development, is best summarized in [Cassell et al., 2000c]. The current study is based on the REA ECA (see Figure 2.1), a simulated real estate agent, who uses vision-based gesture recognition, speech recognition, discourse planning, sentence and gesture planning, speech synthesis and animation of a 3D body [Cassell et al., 1999]. Some of the other major systems developed to date are Steve [Rickel and Johnson, 1998], the DFKI Persona [André et al., 1996], Olga [Beskow and
Social Dialogue with Embodied Conversational Agents
33
McGlashan, 1997], and pedagogical agents developed by Lester et al. [1999]. Sidner and Dzikovska [2005] report progress on a robotic ECA that performs hosting activities, with a special emphasis on “engagement” – an interactional behaviour whose purpose is to establish and maintain the connection between interlocutors during a conversation. These systems vary in their linguistic generativity, input modalities, and task domains, but all aim to engage the user in natural, embodied conversation. Little work has been done on modelling social dialogue with ECAs. The August system is an ECA kiosk designed to give information about local restaurants and other facilities. In an experiment to characterize the kinds of things that people would say to such an agent, over 10,000 utterances from over 2,500 users were collected. It was found that most people tried to socialize with the agent, with approximately 1/3 of all recorded utterances classified as social in nature [Gustafson et al., 1999].
4.2
Related Studies on Embodied Conversational Agents
Koda and Maes [1996] and Takeuchi and Naito [1995] studied interfaces with static or animated faces, and found that users rated them to be more engaging and entertaining than functionally equivalent interfaces without a face. Kiesler and Sproull [1997] found that users were more likely to be cooperative with an interface agent when it had a human face (vs. a dog or cartoon dog). André, Rist and Muller found that users rated their animated presentation agent (“PPP Persona”) as more entertaining and helpful than an equivalent interface without the agent [André et al., 1998]. However, there was no difference in actual performance (comprehension and recall of presented material) in interfaces with the agent vs. interfaces without it. In another study involving this agent, van Mulken, André and Muller found that when the quality of advice provided by an agent was high, subjects actually reported trusting a text-based agent more than either their ECA or a video-based agent (when the quality of advice was low there were no significant differences in trust ratings between agents) [van Mulken et al., 1999]. In a user study of the Gandalf system [Cassell et al., 1999], users rated the smoothness of the interaction and the agent’s language skills significantly higher under test conditions in which Gandalf utilized limited conversational behaviour (gaze, turn-taking and beat gesture) than when these behaviours were disabled. In terms of social behaviours, Sproull et al. [1997] showed that subjects rated a female embodied interface significantly lower in sociability and gave it a significantly more negative social evaluation compared to a text-only interface. Subjects also reported being less relaxed and assured when interacting with the embodied interface than when interacting with the text interface.
34
Advances in Natural Multimodal Dialogue Systems
Finally, they gave themselves significantly higher scores on social desirability scales, but disclosed less (wrote significantly less and skipped more questions in response to queries by the interface) when interacting with an embodied interface vs. a text-only interface. Men were found to disclose more in the embodied condition and women disclosed more in the text-only condition. Most of these evaluations have tried to address whether embodiment of a system is useful at all, by including or not including an animated figure. In their survey of user studies on embodied agents, Dehn and van Mulken conclude that there is no “persona effect”, that is a general advantage of an interface with an animated agent over one without an animated agent [Dehn and van Mulken, 2000]. However, they believe that lack of evidence and inconsistencies in the studies performed to date may be attributable to methodological shortcomings and variations in the kinds of animations used, the kinds of comparisons made (control conditions), the specific measures used for the dependent variables, and the task and context of the interaction.
4.3
Related Studies on Mediated Communication
Several studies have shown that people speak differently to a computer than another person, even though there are typically no differences in task outcomes in these evaluations. Hauptmann and Rudnicky [1988] performed one of the first studies in this area. They asked subjects to carry out a simple informationgathering task through a (simulated) natural language speech interface, and compared this with speech to a co-present human in the same task. They found that speech to the simulated computer system was telegraphic and formal, approximating a command language. In particular, when speaking to what they believed to be a computer, subject’s utterances used a small vocabulary, often sounding like system commands, with very few task-unrelated utterances, and fewer filled pauses and other disfluencies. These results were extended in research conducted by Oviatt [Oviatt, 1995; Oviatt and Adams, 2000; Oviatt, 1998], in which she found that speech to a computer system was characterized by a low rate of disfluencies relative to speech to a co-present human. She also noted that visual feedback has an effect on disfluency: telephone calls have a higher rate of disfluency than co-present dialogue. From these results, it seems that people speak more carefully and less naturally when interacting with a computer. Boyle et al. [1994] compared pairs of subjects working on a map-based task who were visible to each other with pairs of subjects who were co-present but could not see each other. Although no performance difference was found between the two conditions, when subjects could not see one another, they compensated by giving more verbal feedback and using longer utterances. Their conversation was found to be less smooth than that between mutually visible
Social Dialogue with Embodied Conversational Agents
35
partners, indicated by more interruptions, and less efficient, as more turns were required to complete the task. The researchers concluded that visual feedback improves the smoothness and efficiency of the interaction, but that we have devices to compensate for this when visibility is restricted. Daly-Jones et al. [1998] also failed to find any difference in performance between video-mediated and audio-mediated conversations, although they did find differences in the quality of the interactions (e.g., more explicit questions in audio-only condition). Whittaker and O’Conaill [1997] survey the results of several studies which compared video-mediated communication with audio-only communication and concluded that the visual channel does not significantly impact performance outcomes in task-oriented collaborations, although it does affect social and affective dimensions of communication. Comparing video-mediated communication to face-to-face and audio-only conversations, they also found that speakers used more formal turn-taking techniques in the video condition even though users reported that they perceived many benefits to video conferencing relative to the audio-only mode. In a series of studies on the effects of different media and activities on trust, Zheng, Veinott et al. have demonstrated that social interaction, even if carried out online, significantly increases people’s trust in each other [Zheng et al., 2002]. Similarly, Bos et al. [2002] demonstrated that richer media – such as face-to-face, video-, and audio-mediated communication – leads to higher trust levels than media with lower bandwidth such as text chat. Finally, a number of studies have been done comparing face-to-face conversations with conversations on the phone [Rutter, 1987]. These studies find that, in general, there is more cooperation and trust in face-to-face interaction. One study found that audio-only communication encouraged negotiators to behave impersonally, to ignore the subtleties of self-presentation, and to concentrate primarily on pursuing victory for their side. Other studies found similar gains in cooperation among subjects playing prisoner’s dilemma face-to-face compared to playing it over the phone. Face-to-face interactions are also less formal and more spontaneous than conversations on the phone. One study found that face-to-face discussions were more protracted and wide-ranging while subjects communicating via audio-only kept much more to the specific issues on the agenda (the study also found that when the topics were more wide-ranging, changes in attitude among the participants was more likely to occur). Although several studies found increases in favourable impressions of interactants in face-to-face conversation relative to audio-only, these effects have not been consistently validated.
36
4.4
Advances in Natural Multimodal Dialogue Systems
Trait-Based Variation in User Responses
Several studies have shown that users react differently to social agents based on their own personality and other dispositional traits. For example, Reeves and Nass have shown that users like agents that match their own personality (on the introversion/ extraversion dimension) more than those which do not, regardless of whether the personality is portrayed through text or speech [Nass and Gong, 2000; Reeves and Nass, 1996]. Resnick and Lammers showed that in order to change user behaviour via corrective error messages, the messages should have different degrees of “humanness” depending on whether the user has high or low self-esteem (“computer-ese” messages should be used with low self-esteem users, while “human-like” messages should be used with highesteem users) [Resnick and Lammers, 1985]. Rickenberg and Reeves showed that different types of animated agents affected the anxiety level of users differentially as a function of whether users tended towards internal or external locus of control [Rickenberg and Reeves, 2000]. In our earlier study on the effects of social dialogue on trust in ECA interactions, we found that social dialogue significantly increased trust for extraverts, while it made no significant difference for introverts [Cassell and Bickmore, 2003]. In light of the studies summarized here, the question that remains is whether these effects continue to hold if the nonverbal cues provided by the ECA are removed.
5.
Social Dialogue in REA
For the purpose of trust elicitation and small talk, we have constructed a new kind of discourse planner that can interleave small talk and task talk during the initial buyer interview, based on the relational model outlined above. An overview of the planner is provided here; details of its implementation can be found in Cassell and Bickmore [2003].
5.1
Planning Model
Given that many of the goals in a relational conversational strategy are nondiscrete (e.g., minimize face threat), and that trade-offs among multiple goals have to be achieved at any given time, we have moved away from static world discourse planning, and are using an activation network-based approach based on Maes’ Do the Right Thing architecture [Maes, 1989]. This architecture provides the capability to transition smoothly from deliberative, planned behaviour to opportunistic, reactive behaviour, and is able to pursue multiple, non-discrete goals. In our implementation each node in the network represents a conversational move that REA can make.
Social Dialogue with Embodied Conversational Agents
37
Thus, during task talk, REA may ask questions about users’ buying preferences, such as the number of bedrooms they need. During small talk, REA can talk about the weather, events and objects in her shared physical context with the user (e.g., the lab setting), or she can tell stories about the lab, herself, or real estate. REA’s conversational moves are planned in order to minimize the face threat to the user, and maximize trust, while pursuing her task goals in the most efficient manner possible. That is, REA attempts to determine the face threat of her next conversational move, assesses the solidarity and familiarity which she currently holds with the user, and judges which topics will seem most relevant and least intrusive to users. As a function of these factors, REA chooses whether or not to engage in small talk, and what kind of small talk to choose. The selection of which move should be pursued by REA at any given time is thus a non-discrete function of the following factors: Closeness REA continually assesses her “interpersonal” closeness with the user, which is a composite representing depth of familiarity and solidarity, modelled as a scalar quantity. Each conversational topic has a predefined, pre-requisite closeness that must be achieved before REA can introduce the topic. Given this, the system can plan to perform small talk in order to “grease the tracks” for task talk, especially about sensitive topics like finance. Topic REA keeps track of the current and past conversational topics. Conversational moves which stay within topic (maintain topic coherence) are given preference over those which do not. In addition, REA can plan to execute a sequence of moves which gradually transition the topic from its current state to one that REA wants to talk about (e.g., from talk about the weather, to talk about Boston weather, to talk about Boston real estate). Relevance REA maintains a list of topics that she thinks the user knows about, and the discourse planner prefers moves which involve topics in this list. The list is initialized to things that anyone talking to REA would know about – such as the weather outside, Cambridge, MIT, or the laboratory that REA lives in. Task goals REA has a list of prioritized goals to find out about the user’s housing needs in the initial interview. Conversational moves which directly work towards satisfying these goals (such as asking interview questions) are preferred. Logical preconditions Conversational moves have logical preconditions (e.g., it makes no sense for REA to ask users what their major is
38
Advances in Natural Multimodal Dialogue Systems
until she has established that they are students), and are not selected for execution until all of their preconditions are satisfied. One advantage of the activation network approach is that by simply adjusting a few gains we can make REA more or less coherent, more or less polite (attentive to closeness constraints), more or less task-oriented, or more or less deliberative (vs. reactive) in her linguistic behaviour. In the current implementation, the dialogue is entirely REA-initiated, and user responses are recognized via a speaker-independent, grammar-based, continuous speech recognizer (currently IBM ViaVoice). The active grammar fragment is specified by the current conversational move, and for responses to many REA small talk moves the content of the user’s speech is ignored; only the fact that the person responded at all is enough to advance the dialogue. At each step in the conversation in which REA has the floor (as tracked by a conversational state machine in REA’s Reaction Module [Cassell et al., 2000a], the discourse planner is consulted for the next conversational move to initiate. At this point, activation values are incrementally propagated through the network (following [Maes, 1989]) until a move is selected whose preconditions are satisfied and whose activation value is over a specified threshold. Within this framework, REA decides to do small talk whenever closeness with the user needs to be increased (e.g., before a task query can be asked), or the topic needs to be moved little-by-little to a desired topic and small talk contributions exist which can facilitate this. The activation energy from the user relevance condition described above leads to REA starting small talk with topics that are known to be in the shared environment with the user (e.g., talk about the weather or the lab).
5.2
Interactional Behaviour during Social Dialogue
Shifts between small talk moves and task moves are marked by conventional contextualization cues – discourse markers and beat gestures. Discourse markers include “so” on the first small talk to task talk transition, “anyway” on resumption of task talk from small talk, and “you know” on transition to small talk from task talk [Clark, 1996]. Prior to producing lengthy utterances, REA gazes away briefly before she starts her turn, partly as a turn-taking and floor-holding move and partly to mask the processing delays in generating long utterances. Finally, REA smiles as soon as she detects that users have started their speaking turns (using audio thresholding of the user’s speech).
5.3
Example Interaction
An interview between REA and a user typically proceeds as shown in the following dialogue. (User responses are only shown in positions in which they
Social Dialogue with Embodied Conversational Agents
39
effect the selection of subsequent moves; REA contributions are her actual utterances.) 1 That microphone is terrible, I hate using those things. 2 Sorry about my voice, this is some engineer’s idea of natural sounding. 3 Are you one of our sponsors? User: Yes 4 Were you at our last sponsor meetings? 5 I got so exhausted at the last sponsor meeting I think I was starting to lose my voice by the end. 6 So, where would you like to live? 7 How many bedrooms do you need? 8 Do you need access to the subway? 9 Is one bath enough? 10 You know, Boston is certainly more expensive than it used to be. 11 Anyway, what can you afford? 12 What kind of down payment can you make? 13 Let me see what I have available. Dialogue 1. “Small Talk” REA.
In this example, REA opens with small talk moves regarding things in her shared physical environment with the user (1-2). She then proceeds to small talk related to sponsors (after establishing that the user is a sponsor). After a few turns, enough closeness has been established (simply by doing small talk) that REA can move into task talk (6-9). However, before bringing up the topic of finance – a topic that is potentially very face threatening for the user – REA decides that additional closeness needs to be established, and moves back into small talk (10). This small talk move not only increases closeness but shifts the topic to finance, enabling REA to then bring up the issue of how much the user is able to afford (11-12). If REA’s adherence to closeness preconditions is reduced, by decreasing the contributions of these preconditions to the activation of joint projects, this results in her engaging in less small talk and being more task goal oriented. If everything else is held constant (relative to the prior example) the following dialogue is produced.
40
Advances in Natural Multimodal Dialogue Systems
1 So, where would you like to live? 2 What can you afford? 3 What kind of down payment can you make? 4 How many bedrooms do you need? 5 Do you need access to the subway? 6 Is one bath enough? 7 Let me see what I have available. Dialogue 2. “Task-only REA”.
In this example, REA does not perform any small talk and sequences the task questions in strictly decreasing order of priority.
6.
A Study Comparing ECA Social Dialogue with Audio-Only Social Dialogue
The dialogue model presented above produces a reasonable facsimile of the social dialogue observed in service encounters such as real estate sales. But, does small talk produced by an ECA in a sales encounter actually build trust and solidarity with users? And, does nonverbal behaviour play the same critical role in human-ECA social dialogue as it appears to play in human-human social interactions? In order to answer these questions, we conducted an empirical study in which subjects were interviewed by REA about their housing needs, shown two “virtual” apartments, and then asked to submit a bid on one of them. For the purpose of the experiment, REA was controlled by a human wizard and followed scripts identical to the output of the planner (but faster, and not dependent on automatic speech recognition or computational vision). Users interacted with one of two versions of REA which were identical except that one had only task-oriented dialogue (TASK condition) while the other also included the social dialogue designed to avoid face threat, and increase trust (SOCIAL condition). A second manipulation involved varying whether subjects interacted with the fully embodied REA – appearing in front of the virtual apartments as a life-sized character (EMBODIED condition) – or viewed only the virtual apartments while talking with REA over a telephone. Together these variables provided a 2x2 experimental design: SOCIAL vs. TASK and EMBODIED vs. PHONE. Our hypotheses follow from the literature on small talk and on trust among humans. We expected subjects in the SOCIAL condition to trust REA more, feel closer to REA, like her more, and feel that they understood each other more
Social Dialogue with Embodied Conversational Agents
41
than in the TASK condition. We also expected users to think the interaction was more natural, lifelike, and comfortable in the SOCIAL condition. Finally, we expected users to be willing to pay REA more for an apartment in the SOCIAL condition, given the hypothesized increase in trust. We also expected all of these SOCIAL effects to be amplified in the EMBODIED condition relative to the PHONE-only condition.
6.1
Experimental Methods
This was a multivariate, multiple-factor, between-subjects experimental design, involving 58 subjects (69% male and 31% female).
6.1.1 Apparatus. One wall of the experiment room was a rear-projection screen. In the EMBODIED condition REA appeared life-sized on the screen, in front of the 3D virtual apartments she showed, and her synthetic voice was played through two speakers on the floor in front of the screen. In the PHONE condition only the 3D virtual apartments were displayed and subjects interacted with REA over an ordinary telephone placed on a table in front of the screen. For the purpose of this experiment, REA was controlled via a wizard-of-oz setup on another computer positioned behind the projection screen. The interaction script included verbal and nonverbal behaviour specifications for REA (e.g., gesture and gaze commands as well as speech), and embedded commands describing when different rooms in the virtual apartments should be shown. Three pieces of information obtained from the user during the interview were entered into the control system by the wizard: the city the subject wanted to live in; the number of bedrooms s/he wanted; and how much s/he was willing to spend. The first apartment shown was in the specified city, but had twice as many bedrooms as the subject requested and cost twice as much as s/he could afford (they were also told the price was “firm”). The second apartment shown was in the specified city, had the exact number of bedrooms requested, but cost 50% more than the subject could afford (but this time, the subject was told that the price was “negotiable”). The scripts were comprised of a linear sequence of utterances (statements and questions) that would be made by REA in a given interaction: there was no branching or variability in content beyond the three pieces of information described above. This helped ensure that all subjects received the same intervention regardless of what they said in response to any given question by REA. Subject-initiated utterances were responded to with either backchannel feedback (e.g., “Really?”) for statements or “I don’t know” for questions, followed by an immediate return to the script. The scripts for the TASK and SOCIAL conditions were identical, except that the SOCIAL script had additional small talk utterances added to it, as
42
Advances in Natural Multimodal Dialogue Systems
described in [Bickmore and Cassell, 2001]. The part of the script governing the dialogue from the showing of the second apartment through the end of the interaction was identical in both conditions. Procedure. Subjects were told that they would be interacting with REA, who played the role of a real estate agent and could show them apartments she had for rent. They were told that they were to play the role of someone looking for an apartment in the Boston area. In both conditions subjects were told that they could talk to REA “just like you would to another person”.
6.1.2 Measures. Subjective evaluations of REA – including how friendly, credible, lifelike, warm, competent, reliable, efficient, informed, knowledgeable and intelligent she was – were measured by single items on nine-point Likert scales. Evaluations of the interaction – including how tedious, involving, enjoyable, natural, satisfying, fun, engaging, comfortable and successful it was – were also measured on nine-point Likert scales. Evaluation of how well subjects felt they knew REA, how well she knew and understood them and how close they felt to her were measured in the same manner. All scales were adapted from previous research on user responses to personality types in embodied conversational agents [Moon and Nass, 1996]. Liking of REA was an index composed of three items – how likeable and pleasant REA was and how much subjects liked her – measured items on ninepoint Likert scales (Cronbach’s alpha =.87). Amount Willing to Pay was computed as follows. During the interview, REA asked subjects how much they were able to pay for an apartment; subjects’ responses were entered as $X per month. REA then offered the second apartment for $Y (where Y = 1.5 X), and mentioned that the price was negotiable. On the questionnaire, subjects were asked how much they would be willing to pay for the second apartment, and this was encoded as Z. The task measure used was (Z - X) / (Y - X), which varies from 0% if the user did not budge from their original requested price, to 100% if they offered the full asking price. Trust was measured by a standardized trust scale [Wheeless and Grotz, 1977] (alpha =.93). Although trust is sometimes measured behaviourally using a Prisoner’s Dilemma game [Zheng et al., 2002], we felt that our experimental protocol was already too long and that game-playing did not fit well into the real estate scenario. Given literature on the relationship between user personality and preference for computer behaviour, we were concerned that subjects might respond differentially based on predisposition. Thus, we also included composite measures for introversion and extroversion on the questionnaire. Extrovertedness was an index composed of seven Wiggins [Wiggins, 1979] extrovert adjective items: Cheerful, Enthusiastic, Extroverted, Jovial, Outgo-
Social Dialogue with Embodied Conversational Agents
43
ing, and Perky. It was used for assessment of the subject’s personality (alpha =.87). Introvertedness was an index composed of seven Wiggins [Wiggins, 1979] introvert adjective items: Bashful, Introverted, Inward, Shy, Undemonstrative, Unrevealing, and Unsparkling. It was used for assessment of the subject’s personality (alpha =.84). Note that these personality scales were administered on the post-test questionnaire. For the purposes of this experiment, therefore, subjects who scored over the mean on introversion-extroversion were said to be extroverts, while those who scored under the mean were said to be introverts.
6.1.3 Behavioural measures. Rates of speech disfluency (as defined in [Oviatt, 1995]) and utterance length were coded from the video data. Observation of the videotaped data made it clear that some subjects took the initiative in the conversation, while others allowed REA to lead. Unfortunately, REA is not yet able to deal with user-initiated talk, and so user initiative often led to REA interrupting the speaker. To assess the effect of this phenomenon, we therefore divided subjects into PASSIVE (below the mean on number of user-initiated utterances) and ACTIVE (above the mean on number of userinitiated utterances). To our surprise, these measures turned out to be independent of introversion/extroversion (Pearson r=0.042), and to not be predicted by these latter variables.
6.2
Results
Full factorial single measure ANOVAs were run, with SOCIALITY (Task vs. Social), PERSONALITY OF SUBJECT (Introvert vs. Extrovert), MEDIUM (Phone vs. Embodied) and INITIATION (Active vs. Passive) as independent variables.
6.2.1 Subjective assessments of REA. In looking at the questionnaire data, our first impression is that subjects seemed to feel more comfortable interacting with REA over the phone than face-to-face. Thus, subjects in the phone condition felt that they knew REA better (F=5.02; p<.05), liked her more (F=4.70; p<.05), felt closer to her (F=13.37; p<.001), felt more comfortable with the interaction (F=3.59; p<.07), and thought REA was more friendly (F=8.65;p <.005), warm (F=6.72; p<.05), informed (F=5.73; p<.05), and knowledgeable (F=3.86; p<.06) than those in the embodied condition. However, in the remainder of the results section, as we look more closely at different users, different kinds of dialogue styles, and users’ actual behaviour, a more complicated picture emerges. Subjects felt that REA knew them (F=3.95; p<.06) and understood them (F=7.13; p<.05) better when she used task-only dialogue face-to-face; these trends were reversed for phone-based interactions. Task-only dialogue was more fun (F=3.36; p<.08) and less tedious (F=8.77;
44
Advances in Natural Multimodal Dialogue Systems
Figure 2.3.
Ratings of TEDIOUS.
Figure 2.4. ing to pay.
Amount subjects were will-
p<.005; see Figure 2.3) when embodied, while social dialogue was more fun and less tedious on the phone. That is, subjects preferred to interact, and felt better understood, face-to-face when it was a question of simply “getting down to business,” and preferred to interact, and felt better understood, by phone when the dialogue included social chit-chat. These results may be telling us that REA’s nonverbal behaviour inadvertently projected an unfriendly, introverted personality that was especially inappropriate for social dialogue. REA’s model of non-verbal behaviour, at the time of this experiment, was limited to those behaviours linked to the discourse context. Thus, REA’s smiles were limited to those related to the ends of turns, and she did not have a model of immediacy or other nonverbal cues for liking and warmth typical of social interaction [Argyle, 1988]. According to Whittaker and O’Connail [1993], non-verbal information is especially crucial in interactions involving affective cues, such as negotiation or relational dialogue, and less important in purely problem-solving tasks. This interpretation of the results is backed up by comments such as this response from a subject in the face-to-face social condition: The only problem was how she would respond. She would pause then just say “OK”, or “Yes”. Also when she looked to the side and then back before saying something was a little bit unnatural.
This may explain why subjects preferred task interactions face-to-face, while on the phone REA’s social dialogue had its intended effect of making subjects feel that they knew REA better, that she understood them better, and that the experience was more fun and less tedious. In our earlier study, looking only at an embodied interface, we reported that extroverts trusted the system more when it engaged in small talk, while introverts were not affected by the use of small talk [Bickmore and Cassell, 2001]. In the current study, these results were re-confirmed, but only in the embodied
Social Dialogue with Embodied Conversational Agents
45
interaction; that is, a three-way interaction between SOCIALITY, PERSONALITY and MEDIUM (F=3.96; p<.06) indicated that extroverts trusted REA more when she used social dialogue in embodied interactions, but there was essentially no effect of user’s personality and social dialogue on trust in phone interactions. Further analysis of the data indicates that this result derives from the substantial difference between introverts and extroverts in the face-to-face task-only condition. Introverts trusted REA significantly more in the faceto-face task-only condition than in the other conditions (p<.03), while extroverts trusted her significantly less in this condition than in the other conditions (p.<01). In light of these new observations, our earlier results indicating that social dialogue leads to increased trust (for extroverts at least) needs to be revised. This further analysis indicates that the effects we observed may be due to the attraction of a computer displaying similar personality characteristics, rather than the process of trust-building. That is, in the face-to-face, task-only condition, both verbal and nonverbal channels appear to have inadvertently indicated that REA was an introvert (also supported by the comments that REA’s gaze-away behaviour was too frequent, an indication of introversion [Wilson, 1977]), and in this condition we find the introverts trusting more, and extroverts trusting less. In all other conditions, the personality cues are either conflicting (a mismatch between verbal and nonverbal behaviour has been demonstrated to be disconcerting to users [Nass and Gong, 2000]) or only one channel of cues is available (i.e. on the phone), yielding trust ratings that are close to the overall mean. There was, nevertheless, a preference by extroverts for social dialogue as demonstrated by the fact that, overall, extroverts liked REA more when she used social dialogue, while introverts liked her more when she only talked about the task (F=8.09; p<.01). Passive subjects felt more comfortable interacting with REA than active subjects did, regardless of whether the interaction was face-to-face or on the phone, or whether REA used social dialogue or not. Passive subjects said that they enjoyed the interaction more (F=4.47; p<.05), felt it was more successful (F=6.04; p<.05) and liked REA more (F=3.24; p<.08), and that REA was more intelligent (F=3.40; p<.08), and knew them better (F=3.42; p<.08) than active subjects. These differences may be explained by the fixed-initiative dialogue model used in the WOZ script. REA’s interaction was designed for passive users – there was very little capability in the interaction script to respond to unanticipated user questions or statements – and user initiation attempts were typically met with uncooperative system responses or interruptions. But, given the choice between phone and face-to-face, passive users preferred to interact with REA face-to-face: they rated her as more friendly (F=3.56; p<.07) and informed (F=6.30; p<.05) in this condition. Passive users also found the phone
46
Advances in Natural Multimodal Dialogue Systems
to be more tedious, while active users also found the phone to be less tedious (F=5.15; p<.05). Active users may have found the face-to-face condition particularly frustrating since processing delays may have led to the perception that the floor was open (inviting an initiation attempt), when in fact the wizard had already instructed REA to produce her next utterance. However, when interacting on the phone, active users differed from passive users in that active users felt she was more reliable when using social dialogue and passive users felt she was more reliable when using task-only dialogue. When interacting face-to-face with REA, there was no such distinction between active and passive users (F=4.67; p<.05).
6.2.2 Effects on task measure. One of the most tantalizing results obtained is that extroverts were willing to pay more for the same apartment in the embodied condition, while introverts were willing to pay more over the phone (F=3.41; p<.08), as shown in Figure 2.4. While potentially very significant, this finding is a little difficult to explain, especially given that trust did not seem to play a role in the evaluation. Perhaps, since we asked our subjects to play the role of someone looking for an apartment, and given that the apartments displayed were cartoon renditions, the subjects may not have felt personally invested in the outcome, and thus may have been more likely to be persuaded by associative factors like the perceived liking and credibility of REA. In fact, trust has been shown to not play a role in persuasion when “peripheral route” decisions are made, which is the case when the outcome is not of personal significance [Petty and Wegener, 1998]. Further, extroverts are not only more sociable, but more impulsive than introverts [Wilson, 1977], and impulse buying is governed primarily by novelty [Onkvisit and Shaw, 1994]. Extroverts did rate face-to-face interaction as more engaging than phone-based interaction (though not at a level of statistical significance), while introverts rated phone-based interactions as more engaging, providing some support for this explanation. It is also possible that this measure tells us more about subjects’ assessment of the house than of the realtor. In future experiments we may ask more directly whether the subject perceived the realtor to be asking a fair price. Perception of fairness of a price may be more linked to trust than is actual price demanded for a property. 6.2.3 Gender effects. Women felt that REA was more efficient (F=5.61; p<.05) and reliable (F=4.99; p<.05) in the embodied condition than when interacting with her over the phone, while men felt that she was more efficient and reliable by phone. Of course, REA has a female body and a female voice and so in order to have a clearer picture of the meaning of these results, a similar study would need to be carried out with a male realtor.
47
Social Dialogue with Embodied Conversational Agents Table 2.2. Speech disfluencies per 100 words.
Disfluencies
Embodied
Phone
Overall
4.83
6.73
5.83
Table 2.3. Speech disfluencies per 100 Words for different types of human-human and simulated human-computer interactions (adapted from Oviatt [Oviatt, 1995]). Human-human speech Two-person telephone call Two-person face-to-face dialogue
8.83 5.5
Human-computer speech Unconstrained computer interaction Structured computer interaction
1.80 0.83
6.2.4 Effects on behavioural measures. Although subjects’ beliefs about REA and about the interaction are important, it is at least equally important to look at how subjects act, independent of their conscious beliefs. In this context we examined subjects’ disfluencies when speaking with REA. Remember that disfluency can be a measure of naturalness – human-human conversation demonstrates more disfluency than does human-computer communication [Oviatt, 1995]. The rates of speech disfluencies (per 100 words) are shown in Table 2.2. Comparing these to results from previous studies (see Table 2.3) indicates that interactions with REA were more similar to humanhuman conversation than to human-computer interaction. When asked if he was interacting with a computer or a person, one subject replied “A computerperson I guess. It was a lot like a human.” There were no significant differences in utterance length (MLU) across any of the conditions. Strikingly, the behavioural measures indicate that, with respect to speech disfluency rates, talking to REA is more like talking to a person than talking to a computer. Once again, there were significant effects of MEDIUM, SOCIALITY and PERSONALITY on disfluency rate (F=7.09; p<.05), such that disfluency rates were higher in TASK than SOCIAL, higher overall for INTROVERTs than EXTROVERTs, higher for EXTROVERTs on the PHONE, and higher for INTROVERTs in EMBODIED condition. These effects on disfluency rates are consistent with our conclusion that REA’s nonverbal behaviours inadvertently projected an introverted and asocial persona, and with the secondary hypoth-
48
Advances in Natural Multimodal Dialogue Systems
esis that the primary driver on disfluency is cognitive load, once the length of the utterance is controlled for [Oviatt, 1995]. Given our results, this hypothesis would indicate that social dialogue requires lower cognitive load than task-oriented dialogue, that conversation requires a higher cognitive load on introverts than extraverts, that talking on the phone is more demanding than talking face-to-face for extraverts, and that talking face-to-face is more demanding than talking on the phone for introverts, all of which seem reasonable.
7.
Conclusion
The complex results of this study give us hope for the future of embodied conversational agents, but also a clear roadmap for future research. In terms of their behaviour with REA, users demonstrated that they treat conversation with her more like human-human conversation than like human-computer conversation. Their verbal disfluencies are the mark of unplanned speech, of a conversational style. However, in terms of their assessment of her abilities, this did not mean that users saw REA through rose-colored glasses. They were clear about the necessity not only to embody the interaction, but to design every aspect of the embodiment in the service of the same interaction. That is, face-to-face conversations with ECAs must demonstrate the same quick timing of nonverbal behaviours as humans (not an easy task, given the state of the technologies we employ). In addition, the persona and nonverbal behaviour of an ECA must be carefully designed to match the task, a conversational style, and user expectations. Relative to other studies on social dialogue and interactions with ECAs, this study has taught us a great deal about how to build ECAs capable of social dialogue and the kinds of applications in which this is important to do. As demonstrated in the study of the August ECA, we found that people will readily conduct social dialogue with an ECA in situations in which there is no time pressure to complete a task or in which it is a normal part of the script for similar human-human interactions, and many people actually prefer this style of interaction. Consequently, we are in the process of developing and evaluating an ECA in the area of coaching for health behaviour change [Bickmore, 2002], an area in which social dialogue – and relationship-building behaviours in general – are known to significantly effect task outcomes. Relative to prior findings in human-human interaction that social dialogue builds trust, our inability to find this effect across all users is likely due to shortcomings in our model, especially in the area of appropriate nonverbal behaviour and lack of uptake of subjects’ conversational contributions. Given these shortcomings, similarity attraction effects appear to have overwhelmed the social-dialogue-trust effects, and we observed introverts liking and trusting REA more when she behaved consistently introverted and extraverts liking and
Social Dialogue with Embodied Conversational Agents
49
trusting her more when she behaved consistently extraverted. Our conclusion from this is that, adding social dialogue to embodied conversational agents will require a model of social nonverbal behaviour consistent with verbal conversational strategies, before the social dialogue fulfils its trust-enhancing role. As computers begin to resemble humans, the bar of user expectations is raised: people expect that REA will hold up her end of the conversation, including dealing with interruptions by active users. We have begun to demonstrate the feasibility of embodied interfaces. Now it is time to design ECAs that people wish to spend time with, and that are able to use their bodies for conversational tasks for which human face-to-face interaction is unparalleled, such as social dialogue, initial business meetings, and negotiation.
Acknowledgements Thanks to Ian Gouldstone, Jennifer Smith and Elisabeth Sylvan for help in conducting the experiment and analyzing data, and to the rest of the Gesture and Narrative Language Group for their help and support.
References Altman, I. and Taylor, D. (1973). Social Penetration: The Development of Interpersonal Relationships. New York: Holt, Rinhart & Winston. Andersen, P. and Guerrero, L. (1998). The Bright Side of Relational Communication: Interpersonal Warmth as a Social Emotion. In Andersen, P. and Guerrero, L., editors, Handbook of Communication and Emotion, pages 303–329. New York: Academic Press. André, E., Muller, J., and Rist, T. (1996). The PPP Persona: A Multipurpose Animated Presentation Agent. In Proceedings of Advanced Visual Interfaces, pages 245–247, Gubbio, Italy. André, E., Rist, T., and Muller, J. (1998). Integrating Reactive and Scripted Behaviors in a Life-Like Presentation Agent. In Proceedings of the Second International Conference on Autonomous Agents, pages 261–268, Minneapolis, Minnesota, USA. Argyle, M. (1988). Bodily Communication. New York: Methuen & Co. Ltd. Argyle, M. (1990). The Biological Basis of Rapport. Psychological Inquiry, 1:297–300. Beskow, J. and McGlashan, S. (1997). Olga: A Conversational Agent with Gestures. In André, E., editor, Proceedings of the IJCAI 1997 Workshop on Animated Interface Agents: Making Them Intelligent, Nagoya, Japan. San Francisco: Morgan-Kaufmann Publishers. Bickmore, T. (2002). When Etiquette Really Matters: Relational Agents and Behavior Change. In Proceedings of AAAI Fall Symposium on Etiquette for Human-Computer Work, pages 9–10, Falmouth, MA.
50
Advances in Natural Multimodal Dialogue Systems
Bickmore, T. and Cassell, J. (2001). Relational Agents: A Model and Implementation of Building User Trust. In Proceedings of CHI 2001, pages 396– 403, Seattle, WA. Bos, N., Olson, J. S., Gergle, D., Olson, G. M., and Wright, Z. (2002). Effects of Four Computer-Mediated Communications Channels on Trust Development. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 135–140, Minneapolis, Minnesota, USA. Boyle, E., Anderson, A., and Newlands, A. (1994). The Effects of Visibility in a Cooperative Problem Solving Task. Language and Speech, 37(1):1–20. Brown, P. and Levinson, S. (1978). Universals in Language Usage: Politeness Phenomena. In Goody, E., editor, Questions and Politeness: Strategies in Social Interaction, pages 56–289. Cambridge: Cambridge University Press. Brown, R. and Gilman, A. (1972). The Pronouns of Power and Solidarity. In Giglioli, P., editor, Language and Social Context, pages 252–282. Harmondsworth: Penguin. Brown, R. and Gilman, A. (1989). Politeness Theory and Shakespeare’s Four Major Tragedies. Language in Society, 18:159–212. Cassell, J. and Bickmore, T. (2003). Negotiated Collusion: Modeling Social Language and its Relationship Effects in Intelligent Agents. User Modeling and Adaptive Interfaces, 13(1-2):89–132. Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjálmsson, H., and Yan, H. (1999). Embodiment in Conversational Interfaces: Rea. In Proceedings of ACM SIGCHI Conference on Human Factors in Computing Systems, pages 520–527, Pittsburgh, PA. Cassell, J., Bickmore, T., Campbell, L., Vilhjálmsson, H., and Yan, H. (2000a). Human Conversation as a System Framework: Designing Embodied Conversational Agents. In Embodied Conversational Agents, pages 29–63. Cambridge, MA: MIT Press. Cassell, J., Bickmore, T., H, Vilhjálmsson, and Yan, H. (2000b). More Than Just a Pretty Face: Affordances of Embodiment. In Proceedings of the 5th International Conference on Intelligent User Interfaces, pages 52–59, New Orleans, Louisiana. Cassell, J., Sullivan, J., Prevost, S., and Churchill, E. (2000c). Embodied Conversational Agents. Cambridge, MA: MIT Press. Cheepen, C. (1988). The Predictability of Informal Conversation. New York: Pinter. Chovil, N. (1991). Discourse-Oriented Facial Displays in Conversation. Research on Language and Social Interaction, 25(1991/1992):163–194. Clark, H. H. (1996). Using Language. Cambridge: Cambridge University Press. Coupland, J., Coupland, N., and Robinson, J. D. (1992). How Are You? Negotiating Phatic Communion. Language in Society, 21:207–230.
Social Dialogue with Embodied Conversational Agents
51
Daly-Jones, O., Monk, A. F., and Watts, L. A. (1998). Some Advantages of Video Conferencing over High-Quality Audio Conferencing: Fluency and Awareness of Attentional Focus. International Journal of Human-Computer Studies, 49(1):21–58. Dehn, D. M. and van Mulken, S. (2000). The Impact of Animated Interface Agents: A Review of Empirical Research. International Journal of HumanComputer Studies, 52:1–22. Dunbar, R. (1996). Grooming, Gossip, and the Evolution of Language. Cambridge, MA: Harvard University Press. Duncan, S. (1974). On the Structure of Speaker-Auditor Interaction during Speaking Turns. Language in Society, 3:161–180. Garros, D. (1999). Real Estate Agent, Home and Hearth Realty. Personal communication. Goffman, I. (1967). On Face-Work. In Interaction Ritual: Essays on Face-toFace Behavior, pages 5–46. New York: Pantheon. Gustafson, J., Lindberg, N., and Lundeberg, M. (1999). The August Spoken Dialogue System. In Proceedings of the Eurospeech 1999 Conference, pages 1151–1154, Budapest, Hungary. Hauptmann, A. G. and Rudnicky, A. I. (1988). Talking to Computers: An Empirical Investigation. International Journal of Man-Machine Studies, 8(6): 583–604. Jakobson, R. (1960). Concluding Statement: Linguistics and Poetics. In Sebeok, T., editor, Style in Language, pages 351–377. New York: Wiley. Kendon, A. (1980). Conducting Interaction: Patterns of Behavior in Focused Encounters, volume 7. Cambridge: Cambridge University Press. Kiesler, S. and Sproull, L. (1997). Social Human-Computer Interaction. In Friedman, B., editor, Human Values and the Design of Computer Technology, pages 191–199. Stanford, CA: CSLI Publications. Koda, T. and Maes, P. (1996). Agents with Faces: The Effect of Personification. In Proceedings of the Fifth IEEE International Workshop on Robot and Human Communication (RO-MAN 1996), pages 189–194, Tsukuba, Japan. Laver, J. (1975). Communicative Functions of Phatic Communion. In Kendon, A., Harris, R., and Key, M., editors, The Organization of Behavior in Faceto-Face Interaction, pages 215–238. The Hague: Mouton. Laver, J. (1981). Linguistic Routines and Politeness in Greeting and Parting. In Coulmas, F., editor, Conversational Routine, pages 289–304. The Hague: Mouton. Lester, J., Stone, B., and Stelling, G. (1999). Lifelike Pedagogical Agents for Mixed-Initiative Problem Solving in Constuctivist Learning Environments. User Modeling and User-Adapted Interaction, 9(1-2):1–44. Maes, P. (1989). How to Do the Right Thing. Connection Science Journal, 1(3):291–323.
52
Advances in Natural Multimodal Dialogue Systems
Malinowski, B. (1923). The Problem of Meaning in Primitive Languages. In Ogden, C. K. and Richards, I. A., editors, The Meaning of Meaning, pages 296–346. Routledge & Kegan Paul. McNeill, D. (1992). Hand and Mind: What Gestures Reveal about Thought. Cambridge: Cambridge University Press. Moon, Y. and Nass, C. I. (1996). How “Real” are Computer Personalities? Psychological Responses to Personality Types in Human-Computer Interaction. Communication Research, 23(6):651–674. Nass, C. and Gong, L. (2000). Speech Interfaces from an Evolutionary Perspective. Communications of the ACM, 43(9):36–43. Onkvisit, S. and Shaw, J. J. (1994). Consumer Behavior: Strategy and Analysis. New York: Macmillan College Publishing Company. Oviatt, S. (1995). Predicting Spoken Disfluencies during Human-Computer Interaction. Computer Speech and Language, 9:19–35. Oviatt, S. and Adams, B. (2000). Designing and Evaluating Conversational Interfaces with Animated Characters. In Cassell, J., Sullivan, J., Prevost, S., and Churchill, E., editors, Embodied Conversational Agents, pages 319– 345. Cambridge, MA: MIT Press. Oviatt, S. L. (1998). User-Centered Modeling for Spoken Language and Multimodal Interfaces. In Maybury, M. T. and Wahlster, W., editors, Readings in Intelligent User Interfaces, pages 620–630. San Francisco, CA: Morgan Kaufmann Publishers, Inc. Petty, R. and Wegener, D. (1998). Attitude Change: Multiple Roles for Persuasion Variables. In Gilbert, D., Fiske, S., and Lindzey, G., editors, The Handbook of Social Psychology, pages 323–390. New York: McGraw-Hill. Prus, R. (1989). Making Sales: Influence as Interpersonal Accomplishment. Mewbury Park, CA: Sage. Reeves, B. and Nass, C. (1996). The Media Equation. Cambridge: Cambridge University Press. Resnick, P. V. and Lammers, H. B. (1985). The Influence of Self-Esteem on Cognitive Responses to Machine-Like versus Human-Like Computer Feedback. The Journal of Social Psychology, 125(6):761–769. Richmond, V. and McCroskey, J. (1995). Immediacy - Nonverbal Behavior in Interpersonal Relations. Boston: Allyn & Bacon. Rickel, J. and Johnson, W. L. (1998). Task-Oriented Dialogs with Animated Agents in Virtual Reality. In Proceedings of the First Workshop on Embodied Conversational Characters, pages 39–46. Rickenberg, R. and Reeves, B. (2000). The Effects of Animated Characters on Anxiety, Task Performance, and Evaluations of User Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 49–56, The Hague, Amsterdam.
Social Dialogue with Embodied Conversational Agents
53
Rutter, D. R. (1987). Communicating by Telephone. New York: Pergamon Press. Schneider, K. P. (1988). Small Talk: Analysing Phatic Discourse. Marburg: Hitzeroth. Sidner, C. and Dzikovska, M. (2005). A First Experiment in Engagement for Human-Robot Interaction in Hosting Activities. In van Kuppevelt, J., Dybkjær, L., and Bernsen, N. O., editors, Advances in Natural Multimodal Dialogue Systems. Springer. This volume. Sproull, L., Subramani, M., Kiesler, S., Walker, J., and Waters, K. (1997). When the Interface is a Face. In Friedman, B., editor, Human Values and the Design of Computer Technology, pages 163–190. Stanford, CA: CSLI Publications. Stone, M. and Doran, C. (1997). Sentence Planning as Description Using TreeAdjoining Grammar. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics (ACL/EACL 1997), pages 198–205, Madrid, Spain. Svennevig, J. (1999). Getting Acquainted in Conversation. Philadelphia: John Benjamins. Takeuchi, A. and Naito, T. (1995). Situated Facial Displays: Towards Social Interaction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 450–455, Denver, Colorado. van Mulken, S., André, E., and Muller, J. (1999). An Empirical Study on the Trustworthiness of Lifelike Interface Agents. In Bullinger, H.-J. and Ziegler, J., editors, Human-Computer Interaction (Proceedings of HCI-International 1999), pages 152–156. Mahwah, NJ: Lawrence Erlbaum Associates. Wheeless, L. and Grotz, J. (1977). The Measurement of Trust and Its Relationship to Self-Disclosure. Human Communication Research, 3(3):250–257. Whittaker, S. and O’Conaill, B. (1993). An Evaluation of Video Mediated Communication. In Proceedings of Human Factors and Computing Systems (INTERACT/CHI 1993), pages 73–74, Amsterdam, The Netherlands. Whittaker, S. and O’Conaill, B. (1997). The Role of Vision in Face-to-Face and Mediated Communication. In Finn, K., Sellen, A., and Wilbur, S., editors, Video-Mediated Communication, pages 23–49. Lawrence Erlbaum Associates, Inc. Wiggins, J. (1979). A Psychological Taxonomy of Trait-Descriptive Terms. Journal of Personality and Social Psychology, 37(3):395–412. Wilson, G. (1977). Introversion/Extraversion. In Blass, T., editor, Personality Variables in Social Behavior, pages 179–218. New York: John Wiley & Sons.
54
Advances in Natural Multimodal Dialogue Systems
Zheng, J., Veinott, E. S., Bos, N., Olson, J. S., and Olson, G. M. (2002). Trust without Touch: Jumpstarting Long-Distance Trust with Initial Social Activities. In Proceedings of the International Conference for Human-Computer Interaction (CHI), pages 141–146.
Chapter 3 A FIRST EXPERIMENT IN ENGAGEMENT FOR HUMAN-ROBOT INTERACTION IN HOSTING ACTIVITIES∗ Candace L. Sidner Mitsubishi Electric Research Labs 201 Broadway Cambridge, MA 02139, USA
[email protected]
Myroslava Dzikovska Department of Computer Science University of Rochester Rochester, NY 14627, USA
[email protected]
Abstract
To participate in collaborations with people, robots must not only see and talk with people but also make use of the conventions of conversation and of the means to be connected to their human counterparts. This chapter reports on initial research on engagement in human-human interaction and applications to stationary robots interacting with humans in hosting activities.
Keywords:
Human-robot interaction, engagement, gestures, looking, conversation, collaboration, hosting.
1.
Introduction
As part of our ongoing research on collaborative interface agents, we have begun to explore engagement in human interaction. Engagement is the process ∗ Portions
of this chapter are reprinted with permission from C. Sidner and M. Dzikovska, “Human-Robot Interaction: Engagement between Humans and Robots for Hosting Activities,” The Fourth IEEE International Conference on Multi-modal Interfaces, October, 2002, pages 123-128. @ 2002 IEEE.
55 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 55–76. © 2005 Springer. Printed in the Netherlands.
56
Advances in Natural Multimodal Dialogue Systems
by which two, or more, participants establish, maintain and end their perceived connection during interactions they jointly undertake. This process includes: establishment of the initial contact with another participant, negotiation of a collaboration to undertake activities of mutual interest, determination of the ongoing intent of the other participant to continue in the interaction, evaluation of one’s own intentions in staying involved, and determination of when to end the interaction. To understand the engagement process we are studying human-to-human engagement in interactions. Our study provides an understanding of the capabilities required for human-robot interaction. At the same time, experimentation with human-robot interaction provides a valid means to test theories about engagement as well as to produce useful technological results. In this chapter we report on our initial experiments in programming a stationary robot to have initial engagement abilities.
2.
Hosting Activities
This research on engagement is framed around the activity of hosting. Hosting activities are a class of collaborative activity in which an agent provides guidance in the form of information, entertainment, education or other services in the user’s environment (which may be an artificial or the natural world) and may also request that the user undertake actions to support the fulfilment of those services. Hosting activities are situated or embedded activities, because they depend on the surrounding environment as well as the participants involved. They are social activities because, when undertaken by humans, they depend upon the social roles that people play to determine the choice of the next actions, timing of those actions, and negotiation about the choice of actions. In this research, the host of the environment are agents serving as guides, either on-screen animated ones or physical robots. Tutoring applications require hosting activities; this chapter reports on experience with a robot host who is acting as a tutor. Some portions of tutoring, such as testing a student’s knowledge, trouble shooting concepts that a student fails to grasp, and keeping track of what a student knows go beyond the informational services of hosting activities. Thus hosting is part of tutoring but not vice versa. Another common hosting activity is hosting a user in a room with a collection of artefacts. In such an environment, the ability of the host to interact with the physical world and visitors becomes essential, and justifies the creation of physical agents. Room hosting is a core activity in tour guiding in museums, and other indoor and outdoor spaces; see [Burgard et al., 1998] for a robot that can guide a museum tour. Sales activities include hosting as part of their mission in order to make customers aware of types of products and features, locations, personnel, and the like. In many activities, hosting may be
Experiment in Engagement for Human-Robot Interaction
57
intermingled with other tasks, e.g. with selling items in retail sales or student performance evaluation tasks in tutoring. Hosting activities are collaborative because neither party completely determines the goals to be undertaken nor the means of reaching the goal; these actions must be shared between the parties. While the visitor’s interests in the room may seem paramount in determining shared goals, the host’s (private) knowledge of the environment also constrains the goals that can be achieved. Typically the goals undertaken will need to be negotiated between visitor and host. Even in tutoring, where the tutor-host’s plans for how to tutor the student may seem to drive the interaction, the tutor and student negotiate on the problems they will undertake in their encounter. This work hypothesizes that by creating computer agents which function more like human hosts, the human participants will focus on the hosting activity and be less distracted by the agent interface. For example, the agent will gaze at the human partner and at domain objects in ways that appropriately indicate the agent’s attention to each. For example, if a human agent directs her gaze at a partner rather than an object, one can infer that the human is interested in her partner. When a human partner gazes away from the robot or objects of discussion, the robot must be able to assess whether the human has lost interest, and if so, determine how to re-establish or end the engagement between the two participants. Assuring that a robot behaves in ways with which people are familiar in human-to-human interactions increases the likelihood that the interaction will not break down due to the robot’s misusing or misunderstanding the cues of engagement.
3.
What is Engagement?
Engagement is fundamentally a collaborative process [Grosz and Sidner, 1990; Grosz and Kraus, 1996], although it also requires significant private planning on the part of each participant in the engagement. Engagement is collaborative principally because the people who are interacting intend to connect together and cannot do so alone. However, they may be less aware of the actions involved in accomplishing the joint engagement goals, e.g. gaze, head and hand gestures, unlike the conscious actions in other types of collaboration. Engagement, like other types of collaborations, consists of establishing the collaborative goal (the goal to be connected), maintaining the connection, and then ending the engagement. The collaboration process may include negotiation of the goal because a potential collaborator might not decide to become engaged right away or at all. In addition, participants might have to negotiate the means to achieve their goals [Sidner, 1994a; Sidner, 1994b]. For engagement in an interaction, participants negotiate the means for achieving engagement through the various ways they maintain engagement, and repair
58
Advances in Natural Multimodal Dialogue Systems
engagement when it appears to be failing. Described this way, engagement is similar to other collaborative activities. Engagement is an activity that contributes centrally to collaboration on other activities in the world and the conversations that support them. In fact, conversation is impossible without engagement. This claim does not imply that engagement is just a part of conversation. Rather engagement is a collaborative process that occurs in its own right, simply to establish connection between people, a natural social phenomenon of human existence. It is entirely possible to engage another without a single word being said and to maintain the engagement process with no conversation. That is not to say that engagement is possible without any communication; it is not. A person who engages another without language must rely effectively on some form of gestural communication to establish the engagement joint goal and to maintain the engagement. Gesture is also a significant feature of face-to-face interaction when conversations take place [McNeill, 1992]. Being engaged with another can also be the sole purpose of an interaction. The use of just a few words and gestures can establish and maintain connection with another when no other intended goals are relevant. For example, an exchange of hellos, a brief exchange of eye contact and a set of good-byes can accomplish an interaction just to be engaged. In such interactions, one can reasonably claim that the only purpose is to be connected. The current work focuses on interactions, ones that include conversations, where the participants wish to accomplish action in the world rather than just the relational connection that engagement can provide. Much of the engagement process can be accomplished by linguistic means only. Evidence for this statement derives from telephone conversations where participants are engaged with each other and have only the words they say, and prosodic effects (pitch, timing, duration, voice quality and the like) to indicate their desire for establishing, continuing and ending their connection to each other. However, in face-to-face interaction, people look at one another as they talk, and make use of gestures to indicate their interest in the other’s communication, and to indicate that they wish to continue while at the same time using other gestures to access other information in the environment. The engagement process must always balance the need to convey ongoing engagement with the conversational partner (or signal its demise) with the need to look at objects in the environment and perform actions called for by the collaboration (in addition to ones that are independent of it), as well as to interpret those gestures from the partner who has the same requirement to balance these needs.
Experiment in Engagement for Human-Robot Interaction
4.
59
First Experiment in Hosting: A Pointing Robot
In order to experiment with engagement in hosting activities, this effort began with a well-delimited problem: appropriate pointing and beat gestures for a stationary robot, called Mel, while conducting a conversation. Mel’s behaviour is a direct product of extensive research on animated pedagogical agents [Johnson et al., 2000]. It shares with those agents concerns about conversational signals and pointing. Unlike those efforts, Mel has greater dialogue capability, and its conversational signalling, including deixis, comes from combining the CollagenTM [Rich et al., 2001; Rich and Sidner, 1998] and Rea architectures [Cassell et al., 2001a]. Furthermore, while on-screen embodied agents [Cassell et al., 2000b] can point to things in an on-screen environment, on-screen agents cannot effectively point in a 3D space. So it seemed appropriate to explore the effects of deictic behaviour with a robot. To build a robot host, the effort relied significantly on the PACO agent [Rickel et al., 2002] built using CollagenTM for tutoring a user on the operation of a gas turbine engine. The PACO agent tutors a student on the procedures needed to control two engines by their various buttons and dials. The robot served as the tutor in this application and took on the task of speaking all the output and pointing to the portions of the display, tasks normally done by an on-screen agent in the PACO system. The student’s operation of the display, through a combination of speech input and mouse clicks, remained unchanged. Understanding of the student’s speech was accomplished with the IBM ViaVoiceTM speech recognizer, the IBM JSAPI1 to parse and interpret utterances, and the CollagenTM middleware to provide dialogue interpretation and next moves in the conversation, to manage the tutoring goals and to provide a student model for tutoring. The PACO screen for gas turbine engine tutoring is shown in Figure 3.1. The agent is represented in a small window, where text, a cursor hand and an iconic face appear. The face changes to indicate six states: the agent is speaking, is listening to the user, is waiting for the user to reply, is thinking, is acting on the interface, and has failed due to a system crash. The cursor hand is used to point out objects in the display. The robotic agent, Mel, is a stationary robot created at Mitsubishi Electric Research Labs, and consists of 5 servomotors to control the movement of the robot’s head, mouth and two appendages. The robot takes the appearance of a penguin. Mel can open and close his beak, move his head in up-down, and left-right combinations, and flap his "wings" up and down. He also has a laser light on his beak, and a speaker provides audio output for him. See Figure 3.2 for Mel pointing to a button on the gas turbine control panel.
1 See
the ViaVoice SDK, at http://www4.ibm.com/software/speech/dev/sdk java.html.
60
Advances in Natural Multimodal Dialogue Systems
Figure 3.1. The PACO agent for gas turbine engine tutoring.
For gas turbine tutoring, Mel sits in front of a large (2 feet x 3 feet) horizontal flat-screen display on which the gas turbine display panel is projected. To conduct a conversation with the student, Mel addresses the student faceon, and beats with his wings at appropriate points in his turn in the conversation. He uses the PACO system to teach the student procedures on the display. When he wishes to point to a button or dial on the display panel, he points with his beak. When he finishes pointing, he addresses the student face-on again. While Mel’s motor operations are extremely limited, they offer enough movement to undertake beat gestures, which indicate new and old information in utterances [Cassell et al., 2001b]. The head movement is also sufficient to point effectively at objects, so that students can readily see the objects on the panel. The architecture of a CollagenTM agent and an application using Mel is shown in Figure 3.3. Specifics of the CollagenTM internal organization and the means by which CollagenTM is connected to applications are beyond the scope of this chapter; see [Rich and Sidner, 1998; Rich et al., 2001] for more information. Basically, the application is connected to the CollagenTM system through the application adapter. The adapter translates between the semantic events CollagenTM understands and the events/function calls understood by the
Experiment in Engagement for Human-Robot Interaction
61
Figure 3.2. Mel pointing to the gas turbine control panel.
application. The agent controls the application by sending events to perform to the application, and the adapter sends performed events to CollagenTM when a user performs actions on the application. CollagenTM is notified of the propositions uttered by the agent via uttered events. They also go to the AgentHome window, which is a graphical component responsible in CollagenTM for showing the agent’s words on screen as well as generating speech in a speechenabled system. The shaded area highlights the components that were added to the standard CollagenTM middleware. With these additions, utterance events go through the Mel annotator and the BEAT system [Cassell et al., 2001b] in order to generate gestures as well as the utterances that Collagen already produces. More details on the architecture and Mel’s function with it can be found in [Sidner and Dzikovska, 2002]. In tutoring, the CollagenTM architecture is instantiated by means of a detailed set of recipes for the tutoring domain that must be specified in the Planning and Discourse module. Recipes are the means by which the hosting environment is specified for the robot. The recipes do not specify dialogue actions, but instead must detail the actions needed to operate the gas turbine panel and the actions needed to tutor students to use the panel. Additional rules in the
62
Advances in Natural Multimodal Dialogue Systems
Agent help the robot tutor decide what to say or do next, by choosing from a list of next moves created by the Planning and Discourse module. The recipes and rules apply to the domain that is being tutored only, and do not affect the engagement mechanisms that determine the robot’s wing and head gestures. The engagement mechanisms are usable in any tutoring activity. However, the tutoring domain recipes do provide a piece of information for the engagement mechanisms, namely what items in the display need to be pointed out. In this way, the current architecture separates linguistic and gesture functions. Thus, like a person, the robot could convey engagement with its linguistic behaviour but also convey the desire to disengage with its gestures.
Figure 3.3. Architecture of Mel.
5.
Making Progress on Hosting Behaviours
Mel is quite effective at pointing in a display so that it can be readily followed by people. Mel’s beak is a large enough pointer to operate in the way
Experiment in Engagement for Human-Robot Interaction
63
that a finger does. Pointing within a very small margin of error (which is assured by careful calibration before Mel begins talking) makes it possible to locate the appropriate buttons and dials on the screen. Mel returns his “gaze” 2 to the student after pointing, which is a signal that he is staying engaged with the user. These two behaviours are a first step in creating engagement. They make it evident to the student that Mel is interacting with the student and that his looks away are not intended to disengage but rather to accomplish a part of the task at hand. However, human engagement is far richer than what this robot can currently do [Sidner et al., 2003]. Most significantly engagement is a twopart activity: engaging behaviours must be produced, and at the same time, the engaging agent must interpret engaging behaviours from its conversational collaborator. Two of the most basic aspects of engagement are beginning and ending it. The means by which one begins and ends an interaction with Mel are minimal. While Mel responds to a student greeting to start the conversation, he does not have any means or goals to decide when and how to begin the conversation itself. Mel also does not know how to end a conversation or when it is appropriate to do so. Furthermore, Mel has only two weak ways of checking on his partner’s signals of engagement during their interaction: to ask "okay?" and await a response from the user after every explanation he offers, and to await (including indefinitely) a user response (utterance or action) after each time he instructs the user to act. In human-to-human interactions, engagement activities range over far more linguistic and gestural behaviours. In more recent system building efforts using Mel, [Sidner et al., 2005], Mel produces a much wider range of interactions to look at and interpret some looking acts from the human participant, to gesture at objects, and to begin and end conversations. All these capabilities result from more sophisticated subsystems for Mel, such as vision algorithms for detecting human faces and sound location algorithms, as well as careful study of human-human scenarios and video data [Sidner et al., 2003] for determining the types of engagement strategies that humans use effectively in hosting situations. To understand more about the variety of behaviours to signal engagement, in the next section, properties of engagement in human-human interaction are discussed in detail, without regard to what and how robots might do this.
6.
Engagement for Human-Human Interaction
Engagement concerns the establishment of connection, the maintenance of it and the ending of the engagement when one or more of the participants in
2 Mel’s
eyes do not move, so to look at the person, the whole head must turn.
64
Advances in Natural Multimodal Dialogue Systems
an interaction desire it. First let us consider the problem of establishing a connection. A typical beginning of an interaction is familiar to everyone. One approaches another person with whom one would like to interact, looks towards them, attempts to catch their eye. Thereafter additional actions take place, called opening ritual greetings [Luger, 1983]. The two people smile and offer greetings. Verbal greetings can be accompanied by physical behaviours, such as hand shakes, hugs or bowing (which is common in Asian cultures). The greetee responds in kind coordination with the greeter. However, this beginning is not the only possible pattern. The greetee could have responded with a request for delay after part of the process (as in "oh hi, give me a minute," and then attending to another matter). Such a response puts the start of the interaction on hold, and the greeter has the option not to continue, just as the greetee does. A third option is illustrated in the scenario presented in Figure 3.4, part 1. This scenario, constructed for purposes of illustrating engagement, presents an interaction between a host and a visitor in a laboratory hosting setting. It illustrates a less than successful attempt at engagement. The visitor does not immediately engage with the host, who uses a greeting and an offer to provide a tour as means of (1) engaging the visitor and (2) proposing a joint activity in the hosting world. Neither the engagement nor the joint activity are accepted by the visitor. The visitor accomplishes this non-acceptance by ignoring the uptake of the engagement activity, which also quashes the tour offer. While there are many variations on the examples above, the point of the initial part of the interaction is to engage, that is, establish connection between the participants. It is a collaborative process, where each offers signals of engagement, which are used to begin the rest of the greeting process. It is collaborative because each person must do their part and intend their part to establish the connection. Failure to intend to participate denies the collaboration and failure to do one’s part causes the greeting to fail, although with failed greetings, the participants may try again. Once an interaction is established, there are a large number of indications of engagement. The list below is illustrative of ways in which engagement is signalled either directly or in concert with some other activity in the interaction. Note that not all of these signals are essential to interaction because people are capable of undertaking instructions and having conversations by phone and asynchronous connections such as email. However, without these signals, the collaboration is a slow enterprise and likely to breakdown because some actions may be interpreted as intending to disengage. Understanding what each signal conveys about engagement allows us to better understand human communication as well as build artificial agents that engage properly. The signals include:
Experiment in Engagement for Human-Robot Interaction
65
Part 1
Host: Hello, I’m the room host. Would you like me to show you around? Visitor: Part 2 Visitor: What is this? Host: That’s a camera that allows a computer to see as well as a person to track people as they move around a room. Visitor: What does it see? Host: Come over here and look at this monitor <points>. It will show you what the camera is seeing and what it identifies at each moment. Part 3 Visitor: Uh-huh. What are the boxes around the heads? Host: The program identifies the most interesting things in the room – faces. That shows it is finding a face. Visitor: Oh, I see. Well, what else is there? Part 4 Host: Let’s take a look at the multi-level screen over there <points> . Visitor: Host: Is there something else you want to see? Visitor: No, I think I’ve seen enough. Bye. Host: Ok. Bye. Figure 3.4.
Scenario for Room Hosting.
talking about and performing the task; turn taking and grounding [Clark, 1996]; timing (i.e. the pace of uptake of a turn); use of gaze at the speaker, gaze away for taking turns [Duncan, 1974; Cassell et al., 2000a], to track the speaker’s gestures with objects; use of gaze by speaker or non-speaker to check on the attention of other; hand gestures for pointing, iconic description, beat gestures, etc., see [Cassel, 2000; Johnson et al., 2000], and in the hosting setting, gestures associated with domain objects;
66
Advances in Natural Multimodal Dialogue Systems
head gestures (e.g. nods, shakes, sideways turns); body stance (i.e. facing towards the other, turning away, standing up when previously sitting and sitting down); facial gestures (not explored in this work but see [Pelachaud et al., 1996]); non-linguistic auditory responses (e.g. snorts, laughs); social relational activities (e.g. telling jokes, role playing, supportive rejoinders). Among these signals, the use of gaze and looks is one of the most direct means of conveying engagement or lack of it. The principle of conversational tracking governs looking in interactions: a participant in a collaborative conversation tracks the other participant’s face during the conversation in balance with the requirement to look away in order to: (1) participate in actions relevant to the collaboration, or (2) multi-task with activities unrelated to the current collaboration, such as scanning the surrounding environment for interest or danger, avoiding collisions, or performing personal activities [Sidner et al., 2005]. The robot reported here is successful in certain simple looking behaviours. It turns its head to point at objects relevant to the collaboration and then turns again to look at the human tutor for its own turn as well as the human’s. However, the robot cannot perform in the inverse capacity. It does not recognize the looks or gazes of its human partner and hence cannot tell whether the human is looking at objects relevant to the collaboration or looking around out of lack of interest and intentions to disengage. By contrast the host in the scenario in Figure 3.4, part 4 notices when the visitor fails to look at the relevant objects for the interaction and the host takes this as evidence of disengagement, which the host pursues. Taking part in the conversation and the collaboration which a conversation serves are clear evidence of engagement. Failing to perform the actions of a collaboration, however, are evidence of the intention to disengage. In the scenario in Figure 3.4, the visitor follows the host to the appropriate place in part 3. Failure to do so would indicate the intention to disengage from the collaborative task (of answering the visitor’s question about what the camera sees). When the visitor fails to follow the host’s pointing in part 4, this failure to uptake a proposed action that is necessary to their collaborative task is also a sign of disengagement. Of course, when the partner does not take up the task, but has some argument for something else to do, the partner is still engaged. The timing of a conversational turn, taking the turn, and offering backchannels are behaviours that are part of acknowledging the contribution of a conversational partner. During a conversation, the participant acting as the hearer
Experiment in Engagement for Human-Robot Interaction
67
for the conversation grounds the contributions of the other participant by indicating that he or she has heard and understood what was said [Clark, 1996]. Taking a turn makes it possible to use the first part of the turn to do so, and uptake of the turn in pace with the other speaker indicates that all is going smoothly. Both taking a turn and maintaining the pace of the turn are evidence of engagement. When a participant does not take a turn or delays the turn while attending to something outside the collaboration, that participant conveys that they no longer are attending to the interaction. Loss of attention is the most basic indicator that a participant both intends to disengage and has begun to do so. Backchannels are the means by which the hearer grounds the speaker’s turn when a full turn is not taken. Backchannels (using such phrases as “mmhm, yes, uh-huh” during pauses in the other participant’s speaking) permit the hearer to indicate they are following what is said. Most backchannels occur following a speaker utterance which uses prosody and duration to indicate that the backchannel is expected. Failure to provide this expected result is also indication that disengagement is intended. Body stance is a means of expressing the primary focus of one’s attention. Attending to another makes it possible to employ the principle of conversational tracking. Facing one’s conversational partner with one’s physical stance thus supports the ability to track the partner. However, many physical tasks that are part of a collaboration may preclude a frontal stance to the other participant. When washing the dishes, one must face the sink and dishes. When looking under the hood of a car for evidence of a problem, both participants must face the car and lean in under the hood. Using spoken linguistic utterances and more occasional looking indicates one’s ongoing engagement. Just how participants decide to balance looking in these types of circumstances is not well understood. Social cues can serve as signals of engagement. Social relational activities that maintain or establish social relationships among the dialogue participants are evidence of engagement. Social activities such as telling jokes, supportive rejoinders to what another has said, role playing, as well as auditory responses (such as snorts of approval/disapproval and laughing) may serve to accomplish a collaboration task by indicating some trouble with the task. However, they also reinforce social relationships among the participants. Bickmore [2003] notes that social dialogue (even without the performance of accompanying physical actions) increases the trust between dialogue participants, a claim also supported in [Bickmore and Cassell, 2001; Katagiri et al., 2001]. Through videotaped sessions of hosts and visitors, we have observed aspects of social relationships. The hosts and visitors in the videotaped sessions tell humorous stories, offer rejoinders or replies that go beyond conveying that the informa-
68
Advances in Natural Multimodal Dialogue Systems
tion just offered was understood, and even take on role playing with the host and the objects being exhibited. Figure 3.5 contains a portion of a transcript of one hosting session. In that session, the visitor and the host spontaneously play the part of two children using the special restaurant table that the host was demonstrating. The reader should note that their play is coordinated and interactive and is not discussed before it occurs. The role-playing begins at 10 in the figure and ends at 17. This segment of the transcript is preceded by the host P having shown the visitor C how restaurant customers order food in an imaginary restaurant using an actual electronic table, and having explained how wait staff might use the new electronic table to assist customers. Note that utterances by P and C are labelled with their letter, a colon, and italics, while other material describes their body actions. Social relational activities such as the role playing in Figure 3.5 allow participants to demonstrate that they are socially connected to one another in a demonstrable way. They are more than just looking at each other and nodding to one another, especially to accomplish their domain goals. They actively seek ways to indicate to the other that they have some social relation to each other. Failure of a participant to take part in social activities indicates trouble in the interaction; the participant may be simply wishing to change the relationship or indicate how she or he feels about it. However, the failure to participate may also indicate that the participant intends to disengage from the interaction. All collaborations have an end, either because the participants give up on the goal, cf. [Cohen and Levesque, 1990], or because the collaboration succeeds in achieving the desired goals. When collaboration on a domain task ends, participants can elect to negotiate an additional task collaboration or refrain from doing so. When they refrain, they then undertake to terminate their collaboration and end the engagement. Their means to do so are presumably as varied as the rituals to begin engagement, but common patterns comprise pre-closing, expressing appreciation, saying goodbye, with an optional handshake, or other gesture, and then moving away from one another. Preclosings [Schegloff and Sacks, 1973] convey that the end is coming. Expressing appreciation is part of a socially determined custom in the US (and many other cultures) when someone has performed a service for an individual. In the hosting data, the visitor expresses appreciation, with acknowledgement of the host. Where the host has had some role in persuading the visitor to participate, the host may express appreciation as part of the preclosing. Moving away is a strong cue that the disengagement has taken place. However, disengagement can be less smooth. In Figure 3.4, part 4, the host’s offer in the hosting activities is not accepted, not by a verbal denial, but by lack of response, an indication of disengagement. The host, who could have chosen to re-state his offer (with some persuasive comments), instead takes a simpler
Experiment in Engagement for Human-Robot Interaction
69
54: P turns head/eyes to C, raises hands up. C’s head down, eyes on table. 55: P moves away from C and table, raises hands and shakes them; moves totally away, fully upright . 56: P: Uh and show you how the system all works C looks at P and nods. 58: P sits down. P: ah 00: P: ah another aspect that we’re P rotates each hand in coordination. C looks at P. 01: P: worried about P shakes hands. 02: P: you know C nods. 04: P: sort of a you know this would fit very nicely in a sort of theme restaurant P looks at C; looks down. 05: C: MM-hm C looks at P, nods at "MM-hm." P: where you have lots of 06: P draws hands back to chest while looking at C. C: MM-hm P: kids C nods, looking at P. 07: P: I have kids. If you brought them to a P has hands out and open, looks down then at C. C still nods, looking at P. 09: P: restaurant like this P brings hands back to chest. C smiles and looks at P. 10: P looks down; at “oh oh” lunges out with arm and together points to table and looks at table. P: they would go oh oh 11: C: one of these, one of these, one of these C points at each phrase above and looks at table. P laughs. 13: P: I want ice cream <point>, I want cake <point> C: yes yes <simultaneous with “cake”> C points at “cake” looks at P, then brushes hair back. P looking at table. 15: P: pizza <points> P looking at table. C: Yes yes French fries <point> C looks at table as starts to point. 16: P: one of everything P pulls hands back ,looks at C. C: yes C looks at P. 17: P: and if the system just ordered stuff right then and there P looks at C, hands out and shakes, shakes again after “there.” C looking at P; brushes hair. C: Right right (said after “there”) 20: P: you’d be in big trouble || P looking at C and shakes hands again in same way as before. C looking at P, nods at ||. 23: C: But your kids would be ecstatic C looking at P. P looks at C, puts hands in lap Figure 3.5.
A Playtime Example.
70
Advances in Natural Multimodal Dialogue Systems
negotiation tack and asks what the visitor would like to see. This aspect of the interaction illustrates the private assessment and planning which individual participants undertake in engagement. Essentially, it addresses the private question: what will keep us engaged? With the question directed to the visitor, the host also intends to re-engage the visitor in the interaction, which is minimally successful. The visitor responds but uses the response to indicate that the interaction is drawing to a close. The closing ritual presented in Figure 3.4 is, in fact, abrupt, given the overall interaction that has preceded it because the visitor does not follow the American cultural convention of expressing appreciation or offering a simple thanks for the activities performed by the host.
7.
Computational Modelling of Human-Human Hosting and Engagement
A more solid basis of study of human hosting is needed. To that end, ongoing analysis of several videotaped interactions between human hosts and visitors in a natural hosting situation provides details about the use of looking in engagement in hosting [Sidner et al., 2003]. Human-to-human hosting is common enough that many people have participated in such an activity in museums, outdoor tours, or retail settings. Gathering data using videotaping is somewhat intrusive on the typical hosting encounter, but not so much so that people are aware of the taping at all times. Their behaviour is a reliable indicator of typical hosting interactions. Hosting involves more than just engagement, because the host and the visitor have to perform a joint task which is related to the hosting activity. Therefore, engagement has to be investigated in a task-oriented setting where the collaboration on a domain task is ongoing. In each videotaped session, the host was a lab researcher, while the visitor was a guest invited by the first author to visit and see the work going on in the lab. The host demonstrated new technology in a research lab to the visitor for between 28 and 50 minutes, with the variation determined by the host and the equipment available. Careful study of this data has provided new insights about the use of one’s looks that has lead to the principle of conversational tracking. Engagement is a collaboration that often happens together with collaboration on a domain task. In effect, at every moment in the hosting interactions, there are two collaborations happening, one for the participants to accomplish hosting (for example, to tour a lab, which is the domain task), and the other for the participants to stay engaged with each other. While the first collaboration provides evidence for the ongoing process of the second, it is not enough in and of itself. Engagement, as has been shown in this article, depends on many gestural actions as well as conversational comments. Furthermore, the initiation of engagement generally takes place before the domain task is explored,
Experiment in Engagement for Human-Robot Interaction
71
and engagement happens when there are no domain tasks being undertaken. Filling out this story is one of our ongoing research tasks. Collaboration on engagement transpires before, during and after collaboration on a domain task. So what theoretical models will help explain this multicollaboration process? One might want to argue that more complex machinery is needed than that so far suggested in conversational models of collaboration, cf. [Grosz and Sidner, 1990; Grosz and Kraus, 1996; Lochbaum, 1998]. However, we believe it is possible to account for engagement within this framework and to use the framework to develop a working computer agent, in particular a robot. In the conversational models of collaboration, collaboration on a domain task or tasks proceeds from group activity and is accompanied by conversation. The conversation reflects the tasks being undertaken through the structure of the segments of the conversation and the intentions conveyed by participants in those segments. Tasks, or as the theory dubs them, goals are modelled by a set of recipes that specify how actions are performed in the domain to achieve the goal. Actions in the recipe can be performed by either participant, and the participants are presumed to mutually believe or come to mutually believe the recipes of the collaboration. Participants also come to believe individual intentions to perform actions in the recipe. The theory assumes that each participant uses the actions and recipes (1) to recognize how actions by the other participant contribute to the goal and (2) to plan his or her own acts. Conversational collaboration theory does not specifically consider the nature of collaboration for engagement as part of conversation, but the theory and model are specified in a generic way that should also apply to engagement as a collaboration. To apply this theory to engagement, our challenge is to specify the set of rules and recipes that participants in hosting believe will achieve the goals of starting, maintaining and ending engagement. Furthermore, to express this theory computationally, a computational participant (such as a robot) must be able to recognize actions that use those recipes. Clearly recipes for the opening and closing of a conversation as a means of starting and ending engagement can be expressed in terms of actions on the part of the participants in a domain task collaboration. What remains to be discovered is the sets of actions and action groups that form the process of maintaining engagement during a domain task collaboration. In particular, turns in the conversation about the domain task as well as certain gaze, body stance and pointing gestures form the class of engagement actions. The exact composition of that class is as yet unclear. What is clear is that the robot has a two-part task: to engage with the visitor and to track engagement behaviours from the visitor. Finally, social relational behaviours play a part in both the domain collaboration and the engagement process. How does one account for the social relational behaviours discussed above in collaboration theory? While social
72
Advances in Natural Multimodal Dialogue Systems
relational behaviours also tell participants that their counterparts are engaged, they are enacted in the context of the domain task collaboration, and hence must function with the mechanisms for that purpose. Intermixing relational connection, a social goal, and domain collaboration are feasible in collaboration theory models. In particular, the goal of making a relational connection can be accomplished via actions that contribute to the goal of the domain task collaboration. However, each collaborator must ascertain through presumably complex reasoning that the actions (and associated recipes) will serve their social goals as well as contribute to the domain goals. Hence they must choose actions that contribute to social goals as well as domain goals. Then they must also ascertain that the social goals are compatible with ongoing engagement collaboration. Furthermore, they must undertake these goals jointly. The remarkable aspect of the preceding playtime example is that the participants do not explicitly agree to demonstrate how kids will act in the restaurant. Rather the host, who has previously demonstrated other aspects of eating in the electronic restaurant, relates the problem of children in a restaurant and begins to demonstrate the matter when the visitor jumps in and participates jointly. The host accepts this participation by simply continuing his part in it. It appears that they are jointly participating in the hosting goal, but at the same time they are also participating jointly in a social interaction. The details that describe and explain how hosting agents and visitors accomplish this second collaboration are an important goal of ongoing research. Not all social behaviours can be interpreted in the context of the domain task. Sometimes participants interrupt their collaborations to tell a story that is either not pertinent to the collaboration or while pertinent, is somehow out of order. These stories are interruptions of the current domain task collaboration and are understood as having some conversational purpose. As interruptions, they signal that engagement is happening as expected as long as the conversational details of the interruption operate to signal engagement. It is not interruptions in general that signal disengagement or a desire to move to disengage; it is failure to take up the interruption that signals disengagement possibilities.
7.1
Open Questions
The discussion above raises a number of questions that must be addressed in ongoing work. First, in the video data, the host and visitor often look away from each other at non-turn taking times, especially when they are displaying or using demo objects. They also look up or towards the other’s face in the midst of demo activities. The conversational collaboration model does not account for the kind of fine detail required to explain gaze changes, nor do the standard models of turn taking. How are we to account for these gaze changes as part of engagement? What drives collaborators to gaze away and back when
Experiment in Engagement for Human-Robot Interaction
73
undertaking actions with objects so that they and their collaborators remain engaged? Second, in the data, participants do not always explicitly acknowledge or accept what another participant has uttered. Sometimes they use laughs, snorts or expressions of surprise (such as “wow”) to indicate that they have heard and understood and even confirm what another has said. These verbal expressions are appropriate because they express appreciation of a joke, a humorous story or outcome of a demo. We are interested in the range and character of these phenomena, as well as how they are generated and interpreted. Third, this chapter argues that much of engagement can be modelled using the computational collaboration theory model of Grosz and Sidner [1990], Grosz and Kraus [1996], and Lochbaum [1998]. However, a fuller computational picture is needed to explain how participants decide to signal engagement as continuing and how to recognize these signals. One such picture has been explored in [Sidner et al., 2005].
8.
A Next Generation Mel
While pursuing theory of human-human engagement, we continue to add new capabilities for Mel that are founded on human communication. To accomplish this, the next generation Mel combines hosting conversations with other research at MERL on face tracking and sound location [Sidner et al., 2005]. This combination makes it possible to locate visitors and then greet them in ways similar to human experience. These vision and sound algorithms, as well as others under development, permit Mel to make use of nodding and changes in looking, which are important indicators in conversation for turn taking as well as expressions of attention. Mel’s architecture continues to evolve to the point that Mel has both a “brain,” performing CollagenTM related functions and a “body,” fusing sensory data to feed to the brain and controlling Mel’s motions. Building a robot that can detect faces, track them and notice when the face disengages for a brief or extended period of time demonstrates more engagement behaviour than has been possible before. One challenge for a robot host is to experiment with techniques for dealing with unexpected speech input. People, it is said, say that darndest things. While the CollagenTM middleware continues to be used for modelling conversation, the struggle continues to find reasonable behaviours for unexpected visitor utterances. For example, when demonstrating a device that requires filling a cup with water, a visitor may make a mistake and spill water on the floor and say “Oops, I spilled water on the floor.” To understand this utterance, the speech recognizer must correctly process the words, and the sentence semantics must give it a meaningful description, after which the conversation engine must determine the purpose of the meaning description. Finally Mel must re-
74
Advances in Natural Multimodal Dialogue Systems
spond to it. If this sort of utterance was not predicted to occur (and there will be many such utterances), the best response that Mel can currently produce is “I do not understand the purpose of your utterance. Please find a human to help me.” Even that response is only possible if Mel has understood the visitor utterance up to its purposive intent. Failures at speech recognition or sentence interpretation will produce even less informative error messages. These difficulties result from the limits of current speech understanding technology and continue to limit the naturalness of interaction with Mel. Research on improving speech recognition and sentence interpretation is therefore essential to the next generation Mel to improve interaction with human subjects.
9.
Summary
Hosting activities are a natural and common activity among humans, and one that can be accommodated by human-robot interaction. Making the human-machine experience a natural one requires understanding the nature of engagement and applying the same types of human engagement behaviour to robot participation in hosting activities, as well as other activities. Engagement is a collaborative activity that is accomplished through both linguistic and gestural means. The experiments described in this chapter with a stationary robot that can converse and point provide an initial example of an engaged conversationalist. Through study of human-human hosting activities, new models of engagement for human-robot hosting interaction will provide us with a more detailed means of interacting between humans and robots.
Acknowledgements The authors wish to acknowledge the work of Paul Dietz in creating Mel, Neal Lesh, Charles Rich, and the late Jeff Rickel on Collagen and PACO.
References Bickmore, T. (2003). Relational Agents: Effecting Change through HumanComputer Relationships. PhD thesis, Media Arts and Sciences, Massachusetts Institute of Technology. Bickmore, T. and Cassell, J. (2001). Relational Agents: A Model and Implementation of Building User Trust. In Proceedings of the International Conference for Human-Computer Interaction (CHI), pages 396–403. New York: ACM Press. Burgard, W., Cremes, A. B., Fox, D., Haehnel, D., Lakemeyer, G., Schulz, D., Steiner, W., and Thrun, S. (1998). The Interactive Museum Tour Guide Robot. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI), pages 11–18. Menlo Park, CA: AAAI Press.
Experiment in Engagement for Human-Robot Interaction
75
Cassel, J. (2000). Nudge Nudge Wink Wink: Elements of Face-to-Face Conversation for Embodied Conversational Agents. In Cassell, J., Sullivan, J., Prevost, S., and Churchill, E., editors, Embodied Conversational Agents, pages 1–27. Cambridge, MA: MIT Press. Cassell, J., Bickmore, T., Campbell, L., Vilhjálmsson, H., and Yan, H. (2000a). Human Conversation as a System Framework: Designing Embodied Conversational Agents. In Cassell, J., Sullivan, J., Prevost, S., and Churchill, E., editors, Embodied Conversational Agents, pages 29–63. Cambridge, MA: MIT Press. Cassell, J., Nakano, Y. I., Bickmore, T. W., Sidner, C. L., and Rich, C. (2001a). Non-Verbal Cues for Discourse Structure. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL), pages 106–115. Menlo Park, CA: Morgan Kaufman Publishers. Cassell, J., Sullivan, J., Prevost, S., and Churchill, E. (2000b). Embodied Conversational Agents. Cambridge, MA: MIT Press. Cassell, J., Vilhjálmsson, H., and Bickmore, T. W. (2001b). BEAT: The Behavior Expression Animation Toolkit. In Proceedings of SIGGRAPH 2001, pages 477–486. New York: ACM Press. Clark, H. H. (1996). Using Language. Cambridge: Cambridge University Press. Cohen, P. and Levesque, H. (1990). Persistence, Intention and Commitment. In Cohen, P., Morgan, J., and Pollack, M. E., editors, Intentions in Communication, pages 33–70. Cambridge, MA: MIT Press. Duncan, S. (1974). Some Signals and Rules for Taking Speaking Turns in Conversation. In Weitz, S., editor, Nonverbal Communication. New York: Oxford University Press. Grosz, B. J. and Kraus, S. (1996). Collaborative Plans for Complex Group Action. Artificial Intelligence, 86(2):269–357. Grosz, B. J. and Sidner, C. L. (1990). Plans for Discourse. In Cohen, P., Morgan, J., and Pollack, M., editors, Intentions in Communication, pages 417– 444. Cambridge: MIT Press. Johnson, W. L., Rickel, J. W., and Lester, J. C. (2000). Animated Pedagogical Agents: Face-to-Face Interaction in Interactive Learning Environments. International Journal of Artificial Intelligence in Education, 11:47–78. Katagiri, Y., Takahashi, T., and Takeuchi, Y. (2001). Social Persuasion in Human-Agent Interaction. In Proceedings of the Second IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, pages 64–69. Menlo Park, CA: Morgan Kaufman Publishers. Lochbaum, K. E. (1998). A Collaborative Planning Model of Intentional Structure. Computational Linguistics, 24(4):525–572. Luger, H. H (1983). Some Aspects of Ritual Communication. Journal of Pragmatics, 7(3):695–711.
76
Advances in Natural Multimodal Dialogue Systems
McNeill, D. (1992). Hand and Mind: What Gestures Reveal about Thought. Chicago: University of Chicago Press. Pelachaud, C., Badler, N., and Steedman, M. (1996). Generating Facial Expressions for Speech. Cognitive Science, 20(1):1–46. Rich, C. and Sidner, C. L. (1998). COLLAGEN: A Collaboration Manager for Software Interface Agents. User Modeling and User-Adapted Interaction, 8(3/4):315–350. Rich, C., Sidner, C. L., and Lesh, N. (2001). COLLAGEN: Applying Collaborative Discourse Theory to Human-Computer Interaction. AI Magazine, Special Issue on Intelligent User Interfaces, 22(4):15–25. Rickel, J., Lesh, N., Rich, C., Sidner, C. L., and Gertner, A. (2002). Collaborative Discourse Theory as a Foundation for Tutorial Dialogue. In Proceedings of Intelligent Tutoring Systems. New York: ACM Press. Schegloff, E. and Sacks, H. (1973). Opening Up Closing. Semiotica, 7(4):289– 327. Sidner, C. L. (1994a). An Artificial Discourse Language for Collaborative Negotiation. In Proceedings of the Twelfth National Conference on Artificial Intelligence, volume 1, pages 814–819. Cambridge, MA: MIT Press. Sidner, C. L. (1994b). Negotiation in Collaborative Activity: A Discourse Analysis. Knowledge-Based Systems, 7(4):265–267. Sidner, C. L. and Dzikovska, M. (2002). Hosting Activities: Experience with and Future Directions for a Robot Agent Host. In Proceedings of the 2002 Conference on Intelligent User Interfaces, pages 143–150. New York: ACM Press. Sidner, C. L., Kidd, C., Lee, C., Lesh, N., and Rich, C. (2005). Explorations in Engagement for Humans and Robots. Artificial Intelligence. Forthcoming. Sidner, C. L., Lee, C., and Lesh, N. (2003). Engagement by Looking: Behaviors for Robots when Collaborating with People. In Kruijff-Korbayova, I. and Kosny, C., editors, DiaBruck: The Proceedings of the Seventh Workshop on the Semantics and Pragmatics of Dialogue, pages 123–130, University of Saarland, Germany.
PART II
ANNOTATION AND ANALYSIS OF MULTIMODAL DATA: SPEECH AND GESTURE
Chapter 4 FORM An Extensible, Kinematically-Based Gesture Annotation Scheme Craig H. Martell Department of Computer Science and The MOVES Institute∗ Naval Postgraduate School, Monterey, CA, USA [email protected]
Abstract
Annotated corpora have played a critical role in speech and natural language research; and, there is an increasing interest in corpora-based research in sign language and gesture as well. We present a non-semantic, geometrically-based annotation scheme, FORM, which allows an annotator to capture the kinematic information in a gesture just from videos of speakers. In addition, FORM stores this gestural information in Annotation Graph format—allowing for easy integration of gesture information with other types of communication information, e.g., discourse structure, parts of speech, intonation information, etc.1
Keywords:
Gesture, annotation, corpora, corpus-based methods, multimodal communication.
1.
Introduction
FORM2 is an annotation scheme designed both to describe the kinematic information in a gesture, as well as to be extensible in order to add speech and other conversational information.
∗ Much
of this work was done at the University of Pennsylvania and at The RAND Corporation as well. presentation is a modified version of [Martell, 2002]. 2 The author wishes to sincerely thank Adam Kendon for his input on the FORM project. He has provided not only suggestions as to the direction of the project, but also his unpublished work on a kinematicallybased gesture annotation scheme was the FORM project’s starting point [Kendon, 2000]. 1 This
79 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 79–95. © 2005 Springer. Printed in the Netherlands.
80
Advances in Natural Multimodal Dialogue Systems
Our goal is to build an extensible corpus of annotated videos in order to allow for general research on the relationship among the many different aspects of conversational interaction. Additionally, further tools and algorithms to add additional annotations and evaluate inter-annotator agreement will be developed. The end result of this work will be a corpus of annotated conversational interaction, which can be: extended to include new types of information concerning the same conversations; as new tag-sets and coding schemes are developed— discourse-structure or facial-expression, for example—new annotations could easily be added; used to test scientific hypotheses concerning the relationship of the paralinguistic aspects of communication to speech and to meaning; used to develop statistical algorithms to automatically analyze and generate these paralinguistic aspects of communication (e.g., for HumanComputer Interface research).
2.
Structure of FORM
FORM3 is designed as a series of tracks representing different aspects of the gestural space. Generally, each independently moved part of the body has two tracks, one track for Location/Shape/Orientation, and one for Movement. When a part of the body is held without movement, a Location object describes its position and spans the amount of time the position is held. When a part of the body is in motion, Location objects of zero duration are placed at the beginning and end of the movement. Location objects of zero duration are also used to indicate the Location information at critical points in certain complex gestures An object in a movement track spans the time period for which the body part in question is in motion. It is often the case that one part of the body will remain static while others move. For example, a single hand shape may be held throughout a gesture in which the upper arm moves. FORM’s multi-track system allows such disparate parts of single gestures to be recorded separately and efficiently and to be viewed easily once recorded. Once all tracks are filled with the appropriate information, it is easy to see the structure of a gesture broken down into its anatomical components. At the highest level of FORM are groups. Groups can contain subgroups. Within each group or subgroup are tracks. Each track contains a list of attributes concerning a particular part of the arm or body. At the lowest level 3 The author wishes to acknowledge Jesse Friedman and Paul Howard in this section.
here is from their Code Book section of http://www.ldc.upenn.edu/Projects/FORM/.
Most of what is written
81
FORM
(under each attribute), all possible values are listed. Described below are the tracks for the Location of the Right or Left UpperArm. Right/Left Arm Upper Arm (from the shoulder to the elbow). Location Upper arm lift (from side of the body) no lift 0-45 approx. 45 45-90 approx. 90 90-135 approx. 135 135-180 approx. 180 Relative elbow position: The upper arm lift attribute defines a circle on which the elbow can lie. The relative elbow position attribute indicates where on that circle the elbow lies. Combined, these two attributes provide full information about the location of the elbow and reveal total location information (in relation to the shoulder) of the upper arm. extremely inward inward front front-outward outward (in frontal plane) behind far behind Figure 4.1 - Figure 4.4 are example stills with the appropriate values of the above two attributes given. The next three attributes individually indicate the direction in which the biceps muscle is pointed in one spatial dimension. Taken together, these three attributes give the orientation of the upper arm.
82
Advances in Natural Multimodal Dialogue Systems
Figure 4.1.
Upper arm lift: approx. 90; Relative elbow position: outward.
Figure 4.2.
Upper arm lift: approx. 45; Relative elbow position: front.
Figure 4.3.
Upper arm lift: 0-45; Relative elbow position: behind.
Biceps: Inward/Outward none inward (see Figure 4.5) outward (see Figure 4.6)
83
FORM
Figure 4.4. Upper arm lift: no lift; Relative elbow position: outward.
Biceps: Upward/Downward none upward (see Figure 4.7) downward (see Figure 4.8) Biceps: Forward/Backward none forward (see Figure 4.9) backward (see Figure 4.10)
Figure 4.5. inward.
Obscured: This is an binary attribute which allows the annotator to indicate if the attributes and values chosen were “guesses” necessitated by visual occlusion. This attribute is present in each of FORM’s tracks. Again, we have only presented the Location tracks for the Right or Left Arm UpperArm group. The full “Code Book” can be found at
84
Advances in Natural Multimodal Dialogue Systems
Figure 4.6. outward.
Figure 4.7. upward.
Figure 4.8.
downward.
http://www.ldc.upenn.edu/Projects/FORM/. Listed there are all the Group, Subgroup, Track, Attribute and Value possibilities.
85
FORM
Figure 4.9.
Figure 4.10.
3.
forward.
backward.
Annotation Graphs
In order to allow for maximum extensibility, FORM uses annotation graphs (AGs) as its logical representation4 . As described in [Bird and Liberman, 1999], annotation graphs are a formal framework for “representing linguistic annotations of time series data.” AGs do this by extracting away from the physical-storage layer, as well as from application-specific formatting, to provide a “logical layer for annotation systems.” An annotation graph is a collection arcs and nodes which share a common time line, that of a video tape, for example. Each node represents a time stamp and each arc represents some linguistic event spanning the time between the nodes. In FORM, the arcs are labelled with both attributes and values, so that the arc given by the 4-tuple (1,5,Wrist Movement,Side-to-side) represents that there was side-to-side wrist movement between time stamp 1 and time stamp 5.
4 Cf.
[Martell, 2002] for a more complete discussion of FORM’s use of AGs.
86
Advances in Natural Multimodal Dialogue Systems
The advantage of using annotation graphs as the logical representation is that it is easy to combine heterogeneous data—as long as they share a common time line. So, if we have a dataset consisting of gesture-arcs, as above, we can easily extend this dataset by adding more arcs representing discourse structure, for example, simply by adding other arcs which have discourse-structure attributes and values. Again, this allows different researchers to use the same linguistic data for many different purposes, while, at the same time, allowing others to explore the correlations between the different phenomena being studied.
4.
Annotation Example
To gain a better understanding of the process of FORM annotation, we present here a small visual example. The four stills of Figure 4.11 are from a video sequence of Brian MacWhinney teaching a research methods course at Carnegie Mellon University5 . We show these four key frames, here, for illustrative purposes only. The character of the gesture is gleaned from viewing the continuous movement in the video. However, these key frames would be used to set the time stamp and locations of the beginning and end of the movement and in-between points that are important to capturing its shape. The arcs in the annotation graph described below (Figure 4.13) capture the information for the movement in between the key frames.
Figure 4.11. Snapshots of Brian MacWhinney on January 24, 2000.
The FORM annotation, then, of the video, from time stamp 1:13.34 (1 minute 13.34 seconds) to time stamp 1:14.01 is shown in Figure 4.12. This is the view on the data that a particular tool, Anvil [Kipp, 2001], presents to the annotator6 . As described above, FORM uses annotation graphs as its logical 5 These
data were chosen because they are part of the TalkBank collection (http://www.talkbank.org). TalkBank was responsible for funding a large part of this project. 6 Anvil is described in further detail in the appendix.
87
FORM
Figure 4.12.
FORM annotation of Jan24.mov, using Anvil as the annotation tool.
representation of the data; so regardless of choice of annotation tool, FORM’s internal view is the annotation graph given in Figure 4.13.
Figure 4.13.
FORM/Annotation Graph representation of example gesture.
88
Advances in Natural Multimodal Dialogue Systems
Again, FORM uses vectors of attribute:value pairs to capture the gestural information of each section of the arms and hands. In Figure 4.13, then, the arc labelled HandandWrist.Movement from 1:13.34 to 1:13.57 encodes the kinematics of Brian’s moving his right hand or wrist during this time period, and the arc from 1:13.24 to 1:13.67 encodes a change in his right hand’s shape.7
5.
Preliminary Inter-Annotator Agreement Results
Preliminary results from FORM show that with sufficient training, agreement among the annotators can be very high. Table 4.1 shows preliminary interannotator agreement results from a FORM pilot study.8 The results are for two trained annotators for approximately 1.5 minutes of Jan24-09.mov, the video from Figure 4.11. For this clip, the two annotators agreed that there were at least these 4 gesture excursions. One annotator found 2 additional excursions. Precision refers to the decimal precision of the time stamps given for the beginning and end of gestural components. The SAME value means that all time-stamps were given the same value. This was done in order to judge agreement with having to judge the exact beginning and end of an excursion factored out. Exact vs. No-Value percentage refers to whether both the attributes and values matched exactly or whether just the attributes matched exactly. This distinction is included because a gesture excursion is defined as all movement between two rest positions of the arms and hands. For an excursion, the annotators have to judge both which parts of the arms and hands are salient to the movement (e.g., upper-arm lift and rotation, forearm change in orientation and hand/wrist position) as well as what values to assign (e.g., the upper-arm lifted 15-degrees and rotated 45-degrees). So, the No-Value% column captures the degree to which the annotators agree just on the structure of the movement, while Exact% measures agreement on both structure and values. The degree to which inter-annotator agreement varies among these gestures might suggest difficulty in reaching consensus. However, the results on intraannotator agreement studies demonstrate that a single annotator shows similar variance when doing the same video-clip at different times. Table 4.2 gives the intra-annotator results for one annotator annotating the first 2 gesture excursions of Jan24-09.mov. For both sets of data, the pattern is the same: the less precise the time-stamps, the better the results; No-Value% is significantly higher than Exact%. 7 For the example given in Figure 4.11, Brian is only moving his right hand. Accordingly, the Right. which normally would have been prefixed to the arc-labels has been left off. 8 Essentially, all the arcs for each annotator are thrown into a bag. Then all the bags are combined and the intersection is extracted. This intersection constitutes the overlap in annotation, i.e., where the annotators agreed. The percentage of the intersection to the whole is then calculated to get the scores presented.
89
FORM Table 4.1. Inter-Annotator Agreement on Jan24-09.mov. Gesture Excursion
Precision
Exact%
No-Value%
1
2 1 0 SAME
3.41 10.07 29.44 56.92
4.35 12.8 41.38 86.15
2
2 1 0 SAME
37.5 60 75.56 73.24
52.5 77.5 94.81 95.77
3
2 1 0 SAME
0 19.25 62.5 67.61
0 27.81 86.11 95.77
4
2 1 0 SAME
10.2 25.68 57.77 68.29
12.06 31.72 77.67 95.12
Table 4.2. Intra-Annotator Agreement on Jan24-09.mov. Gesture Excursion
Precision
Exact%
No-Value%
1
0 1 0 SAME
5.98 20.52 58.03 85.52
7.56 25.21 74.64 96.55
2
2 1 0 SAME
0 25.81 89.06 90.91
0 28.39 95.31 93.94
It is also important to note that Gesture Excursion 1 is far more complex than Gesture Excursion 2. And, in both simple and complex gestures, interannotator agreement is approaching intra-annotator agreement. Notice, also, that for Excursion 2, inner-annotator agreement is actually better than intraannotator agreement for the first two rows. This is a result of the difficulty for even the same person over time to precisely pin down the beginning and end of a gesture excursion. Although the preliminary results are very encouraging,
90
Advances in Natural Multimodal Dialogue Systems
all of the above suggests that further research concerning training and how to judge similarity of gestures is necessary. Visual information may need very different similarity criteria. Also, it is not clear as of the time of writing how these results might generalize. In particular, the relationship between interand intra-annotator agreement needs to be further explored. In addition, comparison studies with other methods of judging agreement are necessary. For example, how does FORM’s method compare with Cronbach’s alpha evaluated at discrete time-slices9 ? And, what would be the result adding a kappa-score analysis to the bag-of-arcs technique?
6.
Conclusion: Applications to HLT and HCI?
We are augmenting FORM to include richer paralinguistic information (Head/Torso Movement, Transcription/Syntactic Information, and Intonation/Pitch Information). This will create a corpus that allows for research that heretofore we have been unable to do. It will facilitate experiments that we predict will be useful for speech recognition and other Human-Language Technologies (HLT). As an example of similar research, consider the work of Francis Quek et al. [2001]. They have been able to demonstrate that gestural information is useful in helping with automatic detection of discourse transition. However, their results are limited by the amount of kinematic information they can gather with their video-capture system. Further, we believe an augmented-FORM corpus will contain much more specific data and will allow for more fine-grained analyses than is currently feasible. Additionally, knowing the relationships among the different facets of human conversation will allow for more informed research in Human-Computer Interaction (HCI). If one of the goals of HCI is to have better immersive-training, then it will be imperative that we understand the subtle connections among the paralinguistic aspects of interaction. A virtual human, for example, would be much better if it were able to understand, and act in accordance with, all of our communicative modalities. Having an extensible corpus such as we describe in this chapter is a first step that will allow many researchers, across many disciplines, to explore these and other useful ideas.
9I
am indebted to an anonymous reviewer for this suggestion.
91
FORM
Appendix: Other Tools, Schemes and Methods of Gesture Analysis FORM has been designed to be simultaneously useful for both: (1) capturing the kinematics of gesture; and (2) developing a corpus of annotated videos useful for computational analysis and synthesis. Prior research along each of these dimensions that has contributed to, or has motivated, FORM. In this section we briefly review this prior work.
A.1
Non-computational Gesture Analysis
Two important figures in linguistic and psychological (read: non-computational) analysis of gesture are David McNeil and Adam Kendon. Each has developed annotation schemes and systems to analyze and annotate the gestures of speakers in video. However, their respective levels of analysis are quite different.
A.1.1 David McNeill: Hand and Mind. David McNeil [1992] uses as scheme which divides the gesture space into four basic types: Beat gestures; Iconic gestures; Metaphoric gestures; and Deictic gestures. These categories are not meant to be mutually exclusive, although McNeill [1992] has been blamed for making it appear so. According to the McNeill Lab web site: A misconception has arisen about the nature of the gesture categories described in Hand and Mind, to wit, that they are mutually exclusive bins into which gestures should be dumped. In fact, pretty much any gesture is going to involve more than one category. Take a classic upward path gesture of the sort that many subjects produce when they describe the event of the cat climbing up the pipe in our cartoon stimulus. This gesture involves an iconic path-for-path mapping, but is also deictic. . . . Even "simple" beats are often made in a particular location which the speaker has given further structure (e.g. by setting up an entity there and repeatedly referring to it in that spatial location). Metaphoric gestures are de facto iconic gestures. . . . The notion of a type, therefore, should be considered as a continuum–with a given gesture having more or less iconicity, metaphoricity, etc.10 10 http://mcneilllab.uchicago.edu/topics/type.html,
as of 12/15/2003.
92
Advances in Natural Multimodal Dialogue Systems
This work has been very influential, and has been the basis for at least one major computational project (see the BEAT toolkit, below). However, this level of analysis only serves to categorize the gesture. It provides no useful computational information for either automatic gesture analysis or for the automatic generation of gestures in computational agents.
A.1.2 Adam Kendon: The Kinetics of Gesture. Adam Kendon’s approach, best articulated in “An Agenda for Gesture Studies” [Kendon, 1996], is to annotate and analyze at a more fine-grained, level. His goal is to develop a “kinetics” of gesture, analogous to the “phonetics” of speech. As such, he develops in [Kendon, 2000] a scheme which captures how joints are bent, how the different aspects of the arm move, and even how these different dimensions of gesture align with speech. This system describes positions and changes in position of the speakers arms, hands, head and torso. Unfortunately, from our perspective, the annotation scheme was designed to be written on paper or to be used with a word processor. As such, there is not a sufficient way to do finegrained time alignment of gesture to speech. FORM’s original motivation was to computerize this scheme so that fine-grained time alignment was possible. Kendon’s work is the fundamental starting point for FORM.
A.2
Computer-based Annotation Tools and Systems
A.2.1 CHILDES/CLAN. The CHILDES/CLAN system [MacWhinney, 1996] is a suite of tools for studying conversational interactions in general. The suite allows for, among other things, the coding and analyzing of transcripts and for linking those transcripts to digitized audio and video. CLAN supports both CHAT and CA (Conversational Analysis) notation, with the alignment of text to the digitized media at the phrase level. The CHILDES/CLAN system has the major advantage of being one of the first of its kind. The CHILDES database of transcripts of parent-child interactions has dramatically pushed forward both the theory and science of linguistics and language-acquisition. Additionally, it appears possible—in the future—to integrate FORM data with that developed by CHILDES/CLAN into a unified data set. This is due to the open-ended nature and extensibility of both systems. However, from the perspective of actually annotating videos with fine-grained, time-aligned gesture data, CLAN presents a problem. It is possible to describe the gesture that occurred during an utterance, but, given that time alignment is only at the phrasal level, we are unable to finely associate the parts of the gesture with other aspects of conversational interaction. A.2.2 SignStream. SignStream [Neidle et al., 2001] allows users to annotate video and audio language data in multiple parallel fields that display the temporal alignment and relations among events. It has been used most
93
FORM
extensively for analysis of signed languages. It allows for annotation of manual and non-manual (head, face, body) information; type of message (e.g. Whquestion); parts of speech; and spoken-language translations of sentences. Although SignStream would work with the FORM annotation scheme, and there has been some attempt at integrating the two projects, its interface is too comprehensive. Anvil, described below, allows an annotator to quickly see the relationship among all the aspects of left arm, right arm, head and torso movement.
A.2.3 Anvil. Anvil [Kipp, 2001] is a Java-based tool which permits multi-layered annotation of video with gesture, posture, and discourse information. The tags used can be freely specified, and can easily be hierarchically arranged. See Figure 4.12 as an example. Anvil is the tool of choice for the work done in the FORM Lab. It works well for creating multi-tiered, hierarchical, time-aligned annotations. In the beginning of FORM, we toyed with the idea of building FORMTool, our own annotation tool. However, we soon realized that we were just duplicating the benefits of Anvil. Additionally, the extensible nature of Anvil will allow for the development of an Annotation Graph plug-in, so our data can be directly exported to AG format. Currently, we save the data in Anvil XML format and convert to AG format. This future plug-in will avoid this step.
A.3
Systems for Computational Analysis and Generation
A.3.1 VISLab: Francis Quek. The VISLab project11 is a large-scale, low-level-of-analysis research project developed and led by Francis Quek at Wright State University. It has achieved significant results in understanding the relationship of speech to gesture. See [Quek et al., 2001], for example. The long-term intent of the project is to create a large-scale dataset of videos annotated with information about gesture, speech and gaze. This project is in the same spirit as the FORM project, and there has been significant collaboration. There are plans for the VISLab to store the gesture aspects of their data in the FORM format. There are, however, major differences between the two projects. Firstly, FORM aims at developing a mid-level representation that humans can use to annotate gestures and that machines can use to analyze and generate gestures. The VISLab system’s level of representation is much lower. They use multiple cameras to extract 3D information about position, velocity, acceleration, etc. concerning a gesture. They are doing the physics of gesture, where FORM is looking at something closer to the phonetics of gesture. Secondly, the VISLab system requires a complex set up
11 http://vislab.cs.wright.edu/
94
Advances in Natural Multimodal Dialogue Systems
of multiple, precisely-positioned cameras and proper placement of the subjects in order to gather their data. FORM allows any researcher with a notebook PC and a video camera to generate useful data. Thus, FORM can be used in the “field,” where the VISLab system requires a laboratory setting.
A.3.2 BEAT Toolkit: Justine Cassell et al. The other important, large-scale project is the Behaviour Expression Animation Toolkit (BEAT) [Cassell et al., 2001]. It was developed at the MIT Media Lab in the Gesture and Narrative Language Research Group. This work is advanced and is, by far, the most influential to date. It allows for the easy generation of synchronized speech and gesture in computer-animated characters. The animator simply types in the sentence that he/she wishes the character to say, and the BEAT Toolkit generates marked-up text which can serve as the input to any of a number animation systems. The system is extensible to many different communicative behaviours and domains. The output generated for a given input string is domain specific, and the training data for that domain must be provided. The main purpose of BEAT is to appropriately schedule gestures (and other non-verbal behaviours) so they are synchronized with the speech. The BEAT toolkit is the first of a new generation (the beat generation) of animation tool that extracts actual linguistic and contextual information from text in order to suggest correlated gestures, eye gaze, and other nonverbal behaviours, and to synchronize those behaviours to one another. For those animators who wish to maintain the most control over output, BEAT can be seen as a kind of “snap-to-grid” for communicative actions: if animators input text, and a set of eye, face, head and hand behaviours for phrases, the system will correctly align the behaviours to one another, and send the timings to an animation system. For animators who wish to concentrate on higher level concerns such as personality, or lower level concerns such as motion characteristics, BEAT takes care of the middle level of animation: choosing how nonverbal behaviours can best convey the message of typed text, and scheduling them.12
FORM’s relationship to BEAT is more one of potential partners than as competitors. BEAT is most concerned with the automatic generations of the timings, and the higher and lower levels, as aforementioned, are left to the animator. In particular, the lower level of specifying motion characteristics is where FORM is most concerned. We see FORM as potentially a more robust way to specify the gestures for which BEAT schedules the timings. The typology of gestures that BEAT uses is based on the work of McNeill13 . As such, it sees gestures through the eyes of his ontology. It is, then, left up to the animator to specify exactly how a beat or a deictic, for example, is to be animated. 12 [Cassell 13 Cf.
et al., 2001, page 8]. discussion of McNeill, above.
FORM
95
We believe the data generated by the FORM annotation system will allow for a more robust output from BEAT, which would further alleviate the work of animators.
References Bird, S. and Liberman, M. (1999). A Formal Framework for Linguistic Annotation. Technical Report MS-CIS-99-01, Department of Computer and Information Sciences, University of Pennsylvania, Philadelphia, Pennsylvania. http://citeseer.nj.nec.com/article/bird99formal.html. Cassell, J., Vilhjálmsson, H. H., and Bickmore, T. (2001). BEAT: The Behavior Expression Animation Toolkit. In Fiume, E., editor, Proceedings of SIGGRAPH, pages 477–486. ACM Press / ACM SIGGRAPH. http://citeseer.ist.psu.edu/cassell01beat.html. Kendon, A. (1996). An Agenda for Gesture Studies. Semiotic Review of Books, 7(3):8–12. Kendon, A. (2000). Suggestions for a Descriptive Notation for Manual Gestures. Unpublished. Kipp, M. (2001). Anvil - A Generic Annotation Tool for Multimodal Dialogue. In Proceedings of Eurospeech 2001, pages 1367–1370, Aalborg, Denmark. MacWhinney, B. (1996). The CHILDES System. American Journal of SpeechLanguage Pathology, 5:5–14. Martell, C. (2002). FORM: An Extensible, Kinematically-based Gesture Annotation Scheme. In Proceedings of International Language Resources and Evaluation Conference (LREC), pages 183–187. European Language Resources Association. http://www.ldc.upenn.edu/Projects/FORM. McNeill, D. (1992). Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press, Chicago, USA. Neidle, C., Sclaroff, S., and Athitsos, V. (2001). SignStream: A Tool for Linguistic and Computer Vision Research on Visual-Gestural Language Data. In Behavior Research Methods, Instruments, and Computers, volume 33:3, pages 311–320. Psychonomic Society Publications. http://www.bu.edu/asllrp/SignStream/. Quek, F., Bryll, R., McNeill, D., and Harper, M. (2001). Gestural Origo and Loci-Transitions in Natural Discourse Segmentation. Technical Report VISLab-01-12, Department of Computer Science and Engineering, Wright State University. http://vislab.cs.vt.edu/Publications/2001/QueBMH01.html.
Chapter 5 ON THE RELATIONSHIPS AMONG SPEECH, GESTURES, AND OBJECT MANIPULATION IN VIRTUAL ENVIRONMENTS: INITIAL EVIDENCE Andrea Corradini and Philip R. Cohen Center for Human-Computer Communication Department of Computer Science and Engineering Oregon Health & Science University, Portland, OR, USA
{andrea, pcohen}@cse.ogi.edu Abstract
This chapter reports on a study whose goal was to investigate how people make use of gestures and spoken utterances while playing a videogame without the support of standard input devices. We deploy a Wizard of Oz technique to collect audio- video- and body movement-related data on people’s free use of gesture and speech input. Data was collected from ten subjects for up to 60 minutes of game interaction each. We provide information on preferential mode use, as well as the predictability of gesture based on the objects in the scene. The long-term goal of this on-going study is to collect natural and reliable data from different input modalities, which could provide training data for the design and development of a robust multimodal recognizer.
Keywords:
Multimodal speech-gesture data analysis and collection, interaction in immersive environment, object affordance and manipulation, Wizard of Oz study.
1.
Introduction
Human-computer interaction in virtual environments has long been based on gesture, with the user’s hand(s) being tracked acoustically, magnetically, or via computer vision. In order to execute operations in virtual environments, users often are equipped with datagloves, whose hand-shapes are captured digitally, or a tracked device that is equipped with multiple buttons. For a number of reasons, these systems have frequently been difficult to use. First, although the user’s hand/arm motions are commonly called “gesture,” the movements to 97 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 97–112. © 2005 Springer. Printed in the Netherlands.
98
Advances in Natural Multimodal Dialogue Systems
be recognized are typically chosen by the developer. Thus, rather than recognize people’s naturally occurring movements, such systems require users to learn how to move “properly.” Secondly, gestural devices have many buttons and modes, making it difficult for a naïve subject to remember precisely which button in a given mode accomplishes which function. Third, the 3D interaction paradigm usually derives from the 2D-based direct manipulation style, in which one selects an object and then operates upon it. Some systems have modelled the virtual environment interface even more strongly upon the WIMP (windows, icons, menus, pointing device) graphical user interface, providing users with menus that need to be manipulated in the 3D world. Unfortunately, it turns out to be very difficult to select objects and menu entries in 3D environments. Various researchers have attempted to overcome these awkward interfaces in different ways. For example, Hinckley et al. [1994] gave users real-world models to manipulate, causing analogous actions to take place in the virtual environment. Stoakley et al. [1995] provided a miniature copy of the virtual world in the user’s virtual hand, thereby allowing smaller movements in hand to have analogous results on the world itself. Fisher et al. [1986] developed an early multimodal 3D interface for simulated Space Station operations, incorporating limited speech recognition, as well as hand gestures using a VPL dataglove. Weimer and Ganapathy [1989] developed a prototype virtual environment interface that incorporated a VPL dataglove, and a simple speech recognizer. Although only three gestures, all by the user’s thumb, were recognized, and the speech system offered just a 40 word vocabulary, the authors remarked upon the apparent improvement in the interface once voice was added. Based in part on this prior research, we hypothesize that multimodal interaction in virtual environments can ease the users’ burden by distributing the communicative tasks to the most appropriate modalities. By employing speech for its strengths, such as asking questions, invoking actions, etc., while using gesture to point at locations and objects, trace paths, and manipulate objects, users can more easily engage in virtual environment interaction. In order to build such multimodal systems, we need to understand how, if at all, people would speak and/or gesture in virtual environments if given the choice. What would users do on their own, without being limited to the researchers’ preferred gestures and language? Would gestures and language be predictable, and if so from what? Can the recognition of gesture and/or speech in virtual environments be improved by recognizing or understanding the input in another modality, as we find in 2D map-systems [Cohen et al., 1997; Oviatt, 1999]. Regarding predictability of speech and gesture, we hypothesize that without instruction, people will manipulate manufactured objects in VE in the ways they were designed to be manipulated – using their affordances [Gibson, 1977;
On the Relationships among Speech, Gestures, and Object Manipulation
99
Norman, 1988]. Given data indicating a user’s viewpoint on the object, and the degrees of freedom afforded by the object, a system should be able to predict how the user’s hand/arm will move. If the user can also speak, will they employ the same gestures during multimodal interaction as they employ using gesture alone? In order to answer these questions, and to provide a first set of data for training recognizers and statistically-based multimodal integration systems, we conducted a Wizard-of-Oz study of multimodal interaction with a simple, though realistically rendered, computer game.
2.
Study
Ten volunteer subjects (nine adults, one 12 year-old child), interacted with the MystTM III game played on a 2GHz Dell computer with Nvidia GeForce 3 graphics card. None of the subjects had played Myst before; three had played video games regularly. Myst is a semi-immersive 2.5-dimensional game in which a user moves around a complex world, containing both indoor and outdoor scenes. The user views the world through a (moderate) fish-eye viewport, which s/he can rotate 360 degrees, as well as tilt to see above and below. In Myst III, the user’s task is (partially) to travel around an island, rotating a series of beacons so that they shine on one another in a certain sequence, etc. Thus, the game involves navigation, manipulation of objects (doors, a knife switch, beacons, push-buttons, etc.) and search. The subjects wore a set of four 3D magnetic trackers attached to their head and dominant arm. They were told that they could interact with the game as they wished, and that the system could understand their speech and gestures.
2.1
Wizard of Oz Study
The classic method for studying recognition-based systems before the appropriate recognizers have been trained is to employ a Wizard of Oz paradigm [Oviatt et al., 1994]. In this kind of study, an unseen assistant plays the role of the computer, processing the user’s input and responding, as the system should. Importantly, the response time should be rapid enough to support satisfactory interactive performance. In the present study, subjects were told that they would be playing the Myst III computer game, to which they could speak and gesture freely. The user chose where s/he wanted to stand. Subjects “played” the game standing in front of a 50” diagonal flat-panel plasma display in wide-view mode. They could and did speak and/or gesture without constraint. Unbeknownst to the subject, a researcher observed the subjects’ inputs, and controlled the game on a local computer, whose audio and video output was sent to the subject. The “wizard’s” responses resulted in scene navigation and object state changes, with a response time that aver-
100
Advances in Natural Multimodal Dialogue Systems
Figure 5.1. Example of experimental setup, utterance, gesture, and block figure reproducing the subject’s motions.
aged less than 0.5 seconds. Since Myst III (and its predecessors) assumes the user is employing a mouse, it is designed to minimize actual gesturing, allowing only mouse-selection. Although occasionally the Wizard made errors, subjects received no explicitly marked recognition errors. A research assistant was present in the room with the subject, and would upon request give the subject hints about how to play the game, though not about what to say or how to gesture.
2.2
Equipment
To acquire the gesture data, the six-degree-of-freedom Flock of Bird (FOB) magnetic tracking device from Ascension Technology Corporation [2002] was used. We attached four sensors to the subject; one on the top head, one on the upper arm to register the position and orientation of the humerus, one on the lower arm for the wrist position and lower arm orientation, and finally one on the top of the hand. The last three sensors are aligned with each other anytime the subject stretched his or her arm to the side of the body, keeping the palm of the hand facing and parallel to the ground (see Figure 5.2). The data from the FOB are delivered via serial lines, one for each sensor. The four data streams can be processed in real time by a single SGI Octane machine employing the Virtual Reality Peripheral Network (VRPN) package [Taylor, 2001], which provides time-stamps of the data and distrib-
On the Relationships among Speech, Gestures, and Object Manipulation
101
Figure 5.2. Arrangement of trackers on subject’s body.
utes it to customer processes. Because the FOB devices uses a magnetic field that is affected by metallic objects, and the laboratory is constructed of steelreinforced concrete, the data from the sensors is often distorted. As a result, the data is processed with a median filter to partially eliminate noise. A block figure plays back the sensor data, providing both a check on accuracy and distortion. Accuracy of the magnetic tracker in our environment was approximately 0.5 cm.
3.
Data Analysis
The subject’s body motions were captured by the FOB, while the video recorded the subject’s view, and vocal interaction (see Figure 5.1). The speech and gesture on the video were transcribed, an example of which follows:
TRANSCRIPT FROM THE GAME Bold = speech # = location of gesture when not overlapping speech (. . . ) = hesitation/pause XXX = undecipherable [ ] = speech-gestural stroke overlapped event Indications such as “08-01-42-25/44-11” = VCR time-code
102
Advances in Natural Multimodal Dialogue Systems 08-01-42-25/44-11: # go across the bridge [hand held palm open to point toward the bridge then hand used as cursor along bridge] 08-01-47-00: keep going no gesture 08-01-50-03/57-27: [grab] this thing (. . . ) just [grab it] and pull it down and see what happen # reach for rim of telescope with hand, [close fist and pull hand from up to down], [one more time], [one more time] 08-01-59-29/07-04: # can I pull this thing ? (. . . ) ah ahaa # reach for rim of telescope, [close fist, pull hand from left to right circularly], [one more time] 08-02-10-00/15-16: ok [look at] look at this purple and see if there is anything to see [move hand toward the purple ball as to push at it] 08-02-15-16/18-28: # no (...) [back] move hand toward lens of telescope, [close fist to grab at rim of telescope and pull hand back toward the body], [move open hand again back toward the body] 08-02-19-04/26-25: ok # [turn it] again # reach for rim of telescope, [close fist and pull hand from left to right circularly as to rotate rim of telescope], [one more time], [one more time]
The transcript includes both speech to the system, as well as self-talk, but not requests for hints asked of the research assistant. Transcription took approximately two person-months.
3.1
Coding
The following categories were coded: For events that required explicit interaction with the system beyond causing the scene to rotate around, subjects were coded as using gesture-only, speech-only, or multimodal interaction. Numerous subcategories were coded, but this chapter only reports on the subset of gesture and multimodal interaction for which the user employed gesture “manipulatively,” when interacting with an object. Interrater reliability for secondscoring of 18% of the multimodal data was 98%. For ONLY GESTURE: Consistent manipulative gesture: gesture used with the objective of changing the state of an object in the game — e.g., turning a wheel or
On the Relationships among Speech, Gestures, and Object Manipulation
103
pressing a button. Such gestures are consistent if the movement matches the way the object operates. Manipulative gesture, NOT consistent: see above but the movement does not match the way the object operates. For ONLY SPEECH: Speech manipulative: involving any change of state of an object in the scene — e.g., standing in front of the wheel and saying “turn the wheel”, or “press the button” when in the elevator. For SPEECH AND GESTURE TOGETHER Speech manipulative + consistent manipulative gesture: e.g. saying “turn the wheel” AND mimicking the gesture of turning a wheel. Speech manipulative + gesture NOT manipulative: like above but the gesture does not match the way the object functions.
4.
Results
Of the 3956 “interactive events,” we totalled the use of gesture-only, speechonly, or multimodal interaction (see Figure 5.3). Subjects were classified as “habitual users” of a mode of communication if they employed that mode during at least 60% of the available times. As can be seen, 90% of the subjects were habitual users of speech (including multimodal interaction), and 60% of the subjects were habitual users of multimodal interaction. Subjects were classified as “habitual users” of the consistent manipulation of objects strategy if they gesturally manipulated the object at least 10 times in a given communication mode (using gesture-only or multimodally) and 60% of those times, their body motion was consistent with the affordances of that object (i.e., the ways humans would normally employ it). Figure 5.4 provides Myst III images that show some of those objects, whose affordances the reader can readily determine. We examined the two cases separately: use of gesture-only and use of gesture within a multimodal event. Six subjects used gesture within their multimodal interaction manipulatively, rather than deictically, in order to change the state of an object. Of those six, five gestured in a fashion consistent with the object’s affordances, indicating that subjects using multimodal interaction manipulated digital “artefacts” according to the actions afforded by their design (Wilcoxon, p<0.03, Z= -1.89, one-tailed). The one subject who did not do so used a speak-and-point strategy (“turn the wheel” <point>). As for gesture alone, four subjects changed the state of an object with manipulative gesture used in a fashion consistent with the object affordances,
104
Advances in Natural Multimodal Dialogue Systems
Figure 5.3.
Figure 5.4.
Subjects’ use of modalities.
Other objects that can be manipulated in Myst III.
On the Relationships among Speech, Gestures, and Object Manipulation
105
whereas no subjects using manipulative gestures inconsistent with the object’s affordances (Wilcoxon, p=0.023, Z= -2.000 one-tailed).
4.1
Results from Questionnaire
The subjects filled out a post-test questionnaire (see Appendix) indicating, on scales of 1(very low) -10 (very high), their: 1. 2. 3. 4. 5.
Experience with adventure games Immersive experience Involvement in the game Ranking of the interface System response latency
(x = 2.3, σ (x = 4.2, σ (x = 4.1, σ (x = 6.1, σ (x = 5.7, σ
= 1.8) = 2.5) = 2.6) = 1.7) = 2.7)
In other words, the game itself was not particularly exciting, and subjects were only moderately involved in it. However, the speech/gesture interface was slightly better than middling ratings, while the latency inherent in a Wizard-ofOz system did not make the interaction annoyingly slow.
4.2
Comments from Subjects
Subjects were asked, “Do you think there was a ‘preferred’ channel between speech or gesture for the interaction? Or did you feel both channels were equally effective? Please explain”. In response, we received the following comments, which often did not match the subjects’ performance: “I feel that speech was easier for commands, since it is more economical in energy than gesture, and more precise. I grew tired of trying to motion precisely with my arms, and the time it took to turn. I would prefer to use speech over gesture, at least with the way gesture seemed to be implemented here.” Used speech-only 49%, gesture-only 45% of the time. “I am confident that anything I was doing, I could do with either channel. Speech seemed more natural for some situations, and gestures for others, though.” A habitual user of speech-only interaction, using gesture 20% of the time, speechonly 74%. “Preferred (channel) seemed to be speech. The movement felt like I was doing exercises.” A habitual user of speech (100%). “I found voice commands worked better for me than the hand gestures.” A habitual user of multimodal interaction (62%), with no gesture-only interaction. “I think the preferred channel is a gesture, with speech used to make clarifications.” A habitual user of multimodal interaction (83%). “In the first few minutes, I realized that gesturing worked best for me. I always talk to my computer or my car while operating them, so that was more unconscious....I find it very involving to talk to the game...”. A habitual user of multimodal interaction (75%).
106
Advances in Natural Multimodal Dialogue Systems “I was under the distinct impression that gesture did nothing and the system was a speech-recognition one.” A habitual user of multimodal interaction (69%). “I liked hands but used speech when confusing.” A habitual user of multimodal interaction (63%).
5.
Discussion
Results show that if given the opportunity, most subjects (60%) would use multimodal interaction more than 60% of the time for interacting with the game; an additional 30% would use speech-only interaction 60% of the time or more, but only 10% would even use gesture-only half the time. Overall, subjects were found to employ gesture alone 14% of the time, speech alone 41% of the time, and used speech and gesture for 45% of their interactions. These latter results are somewhat skewed by subjects who employed only speech, since in order to navigate in the scene, they issued many more navigation commands (e.g., “go left”) than would be necessary if multimodal interaction were employed. Subjects’ opinions about which modalities were important were often belied by their actions. Some thought that gesture was the key modality, but used multimodal interaction habitually, while others thought speech was the essential modality, but also used multimodal interaction frequently. It would appear that having both available would suit just about all the subjects. When given the opportunity to use whatever gestures they wanted, most subjects who attempted to manipulate an artefact generally did so according to the affordances of those objects – subjects turned wheels, pulled down knife switches, pushed in doorbells, etc. Thus, a future virtual environment control program that detects that a manufactured object is in the scene should be able to predict how a person’s arm would be shaped and would move in order to manipulate that object properly. The gesture recognizer could then adapt to the scene itself, giving such gestures higher weight. As novices to these kinds of games, the subjects found Myst III to be a modestly immersive experience with a moderate degree of involvement. Unlike true 3D games, there is no avatar in the scene that represents the user, and thus one would expect a lesser degree of immersion than a full 3D virtual reality environment. Also, given that the subjects were not supplied with explicit objectives in playing the game, it is not surprising that their degree of involvement with the game was moderate. A number of subjects very much liked the speech-gesture interface, while most gave it mixed reviews.
6.
Related Work
Most of the existing research on gestures has been performed by cognitive scientists who are interested in how people gesture, and the reasons why people
On the Relationships among Speech, Gestures, and Object Manipulation
107
gesture [McNeill, 1993; Krauss, 1998; Goldin-Meadow, 1999; Kendon, 2000]. Various taxonomies of gesture have been offered like e.g., McNeill’s description of iconic, deictic, emblematic, and beat gestures [McNeill, 1993], which usefully inform scientists who build gesture-based systems and avatars [Cassell and Stone, 1999; Wilson et al., 1997]. Specific claims can also be useful to technologists. McNeill [1993] argues that speech and gesture derive from an internal knowledge representation (termed a “catchment”) that encodes both semantic and visual information. Our results tend to confirm this claim, in that the visual representation depicts the object’s affordances, which then determines the manipulative gesture. The present corpus could also be used to confirm Kendon’s [2000] claim that the stroke phase of gestures tends to cooccur with phonologically prominent words. Quek et al. [2001] have provided a case-study of cross-modal cues for discourse segmentation during natural conversation by observing a subject describing her living space to an interlocutor. A comparative analysis of both video and audio data was performed to extract the segmentation cues while expert transcribed the gestures. Then both the gestural and spoken data was correlated for 32 seconds of video. A strong correlation between handedness and high-level semantic content of the discourse was found, and baseline data on the kinds of gestures used was provided. Whereas this style of observational research is needed as a foundation, it needs to be combined with quantitative observational and experimental work in order to be useful to building computer systems. Wilson et al. [1997] employed the McNeill theory to motivate their research to distinguish bi-phasic gestures (e.g., beats) from more meaningful tri-phasic gestures that have preparatory, stroke, and retraction phases. Early work by Hauptmann [1989] employed a Wizard of Oz paradigm and investigated the use of multimodal interaction for simple 3D tasks. It was found that people prefer to use combined speech and gesture interaction over either modality alone, and given the opportunity to do so in a factorially designed experiment, chose to use both 70% of the time (vs. 13% gesture only, and 16% speech only). Their factorial study has the advantage that all subjects were exposed to all ways of communicating, whereas our more ecologically realistic study allowed users to develop their own ways of interacting. On the other hand, it also allowed users to become functionally fixed into their first successful way of operating. Perhaps the most relevant work to ours is that of Sowa and Wachsmuth [2002] who employed a WOZ paradigm to collect subjects’ gestures as they attempt to describe a limited set of objects to a listener within a virtual construction domain. It was found that subjects’ hand shapes corresponded to features of the objects themselves. Based on this data, a prototype system was built that decomposes each gesture into spatial primitives and identifies
108
Advances in Natural Multimodal Dialogue Systems
their interrelationships. The object recognition engine then employs a graphmatching technique to compare the structure of the objects and that of the gesture. This latter work differs from the results presented here in that the game we studied included the task of manipulating the object rather than describing it. Thus, we find people attempt to manipulate artefacts in the ways they were built to be manipulated, whereas Sowa and Wachsmuth found people used their hands in describing objects to indicate their shape. Clearly, the subjects’ goals and intentions make a difference to the kinds of gestures they use, whether in a multimodal or unimodal context.
7.
Future Work
The next steps in this research are to analyze the syntax and semantics of the utterances in conjunction with the form and meaning of the gesture. To date, we have employed unification as the primary information fusion operation for multimodal integration [Cohen et al., 1997; Johnston et al., 1997]. We hypothesize that by using a feature-structured action representation that offers a “manner” attribute [Badler et al., 2000], whose value can itself be an action, and by representing the meanings of gestures as such actions, unification can again serve as the mechanism for information fusion across modalities. The analyses of utterance and gesture will be used to test this hypothesis. Given the identification of the same kind of gesture by the subjects to manipulate the same object, a gesture recognizer can then be trained [Corradini, 2002] and used in a multimodal architecture. We also will investigate to what extent mutual disambiguation of modalities [Oviatt, 1999] can be used to overcome recognition errors in a complete multimodal virtual environment system [Kaiser et al., 2003].
8.
Conclusions
This chapter has provided initial empirical results based on a Wizard of Oz paradigm that indicate people will prefer to use multimodal interaction in virtual environments than speech or gesture alone. Furthermore, when they attempt to manipulate virtual representations of artefacts, they do so according to the objects’ affordances – the ways the objects were designed to be manipulated. We therefore conclude that virtual environment systems should be capable of multimodal interaction that would employ subjects’ natural gestures, and that such systems should in fact employ information about the objects currently in the user’s view to predict the kinds of gestures that will ensue. In the future, information about the users’ gaze could be employed to restrict still further the kinds of gestures that might be employed.
On the Relationships among Speech, Gestures, and Object Manipulation
109
Acknowledgements This research has been supported by the Office of Naval Research, under Grants N00014-99-1-0377, N00014-99-1-0380 and N00014-02-1-0038. We are thankful to Rachel F. Coulston for help in running the study, and to our volunteer subjects.
110
Advances in Natural Multimodal Dialogue Systems
Appendix: Questionnaire MYST III - EXILE PRE-SESSION: 1 Have you ever played MYST III: Exile - the sequel to the MYST and RIVEN series? 2 Have you ever played adventure or puzzle-like games? 3 How would you rank your experience in playing the game on a scale from 1 to 10? (1 = no experience at all - 10 = very experienced player) 4 Would you say that your current state of fitness is: as usual, you are sick, or both? Please explain. POST-SESSION: 5 Did you like the game? Please explain. 6 Would you like to play again? Please explain. 7 What did you find difficult when playing? Please explain. 8 How would you rank the immersive effect of the game on a scale from 1 to 10? (1 = not immersive at all – 10 = amazingly immersive) 9 How would you rank your involvement in playing the game on a scale from 1 to 10? (1 = I played because I was requested to – 10 = I fully wanted to get to the end of the game)? 10 How would you rank the interface on a scale from 1 to 10? (1 = extremely bad, it never did what I wanted to – 10 = amazingly good) 11 How would you rank the latency response of the interface on a scale from 1 to 10? (1 = when I entered a command the interface reacted too slowly – 10 = I did not even realize there was a latency between my commands and the response of the interface) 12 Would you have preferred to play using a mouse/joystick/trackball/keyboard instead the way you played? Please explain. 13 Did you consciously decide when to use speech or gesture for playing? Please explain. 14 Did you have some hints concerning if and how to use gesture or speech or did you just do what it come naturally to you? Please explain. 15 Do you think there was a “preferred” channel between speech or gesture for the interaction? Or did you feel both channels were equally effective? Please explain. 16 Were the sensors attached to your body cumbersome? Did you feel restricted in your movements? Please explain. 17 Do you have any suggestions/criticism concerning the experiment?
On the Relationships among Speech, Gestures, and Object Manipulation
111
References Ascension Technology (2002). Flock of Birds. http://www.ascensiontech.com/. Badler, N., Bindinganavale, R., Allbeck, J., Schuler, W., Zao, L., and Palmer, M. (2000). Paramaterized Action Representation for Virtual Human Agents. In Cassell, J., Sullivan, J., Prevost, S., and Churchill, E., editors, Embodied Conversational Agents, pages 256–286. MIT Press. Cassell, J. and Stone, M. (1999). Living Hand to Mouth: Psychological Theories about Speech and Gesture in Interactive Dialogue Systems. In Proceedings of the AAAI 1999 Fall Symposium on Psychological Models of Communication in Collaborative Systems, pages 34–42. North Falmouth, MA: AAAI Press. Cohen, P. R., Johnston, M., McGee, D. R., Oviatt, S., Pittman, J., Smith, I., Chen, L., and Clow, J. (1997). QuickSet: Multimodal Interaction for Distributed Applications. In Proceedings of the Fifth Annual ACM International Multimedia Conference (Multimedia 1997), pages 31–40. Seattle, WA: ACM Press. Corradini, A. (2002). Real-Time Gesture Recognition by Means of Hybrid Recognizers. In Sowa, T. and Wachsmuth, I., editors, Gesture and Sign Language in Human-Computer Interaction, pages 34–46. Berlin: Springer Verlag. Fisher, S. S., McGreevy, M., Humphries, J., and Robinett, W. (1986). Virtual Environment Display System. In Proceedings of the ACM Workshop on Interactive 3D Graphics, pages 77–87. Chapel Hill, NC: ACM Press. Gibson, J. J. (1977). The Theory of Affordances. In Shaw, R. and Bransford, J., editors, Perceiving, Acting, and Knowing, pages 67–82. Hillsdale, NJ: Lawrence Erlbaum Associates. Goldin-Meadow, S. (1999). The Role of Gesture in Communication and Thinking. Trends in Cognitive Science, 3(11):419–429. Hauptmann, A. G. (1989). Speech and Gestures for Graphic Image Manipulation. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 241–245. New York: ACM Press. Hinckley, K., Pausch, R., Goble, J. C., and Kassell, N. F. (1994). Passive RealWorld Interface Props for Neurosurgical Visualization. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 452– 458. Johnston, M., Cohen, P. R., McGee, D., Oviatt, S., Pittman, J. A., and Smith, I. (1997). Unification-Based Multimodal Integration. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL), pages 281–288, Madrid, Spain. ACL Press. Kaiser, E., Olwal, A., McGee, D., Benko, H., Corradini, A., Li, X., Cohen, P. R., and Feiner, S. (2003). Mutual Disambiguation of 3D Multimodal In-
112
Advances in Natural Multimodal Dialogue Systems
teraction in Augmented and Virtual Reality. In Proceedings of the 5th International Conference on Multimodal Interfaces (ICMI-PUI), pages 12–19, Vancouver, B.C., Canada. Kendon, A. (2000). Language and Gesture: Unity or Duality? In McNeill, D., editor, Language and Gesture, pages 47–63. Cambridge, UK: Cambridge University Press. Krauss, R. M. (1998). Why do we Gesture When we Speak? Current Directions in Psychological Science, 7:54–59. McNeill, D. (1993). Hand and Mind: What Gestures Reveal about Thought. Chicago: University of Chicago Press. Norman, D. A. (1988). The Design of Everyday Things. New York, NY: Currency/Doubleday. Oviatt, S. L. (1999). Mutual Disambiguation of Recognition Errors in a Multimodal Architecture. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 576–583. New York: ACM Press. Oviatt, S. L., Cohen, P. R., and Wang, M. Q. (1994). Toward Interface Design for Human Language Technology: Modality and Structure as Determinants of Linguistic Complexity. Speech Communication, 15(3-4):283–300. Quek, F., McNeill, D., Bryll, R., Duncan, S., Ma, X.-F., Kirbas, C., McCullough, K. E., and Ansari, R. (2001). Gesture and Speech Multimodal Conversational Interaction. Technical report, Electrical Engineering and Computer Science Department, University of Illinois, Chicago. Sowa, T. and Wachsmuth, I. (2002). Interpretation of Shape-Related Iconic Gestures in Virtual Environments. In Sowa, T. and Wachsmuth, I., editors, Gesture and Sign Language in Human-Computer Interaction, pages 21–33. Berlin: Springer Verlag. Stoakley, R., Conway, M. J., and Pausch, R. (1995). Virtual Reality on a WIM: Interactive Worlds in Miniature. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 265–272. ACM Press. Taylor, R. M. (2001). VRPN: A Device-Independent, Network-Transparent VR Peripheral System. In Proceedings of the ACM Symposium on Virtual Reality Software and Technology, pages 55–61. New York: ACM Press. Weimer, D. and Ganapathy, S. K. (1989). A Synthetic Visual Environment with Hand Gesturing and Voice Input. In Bice, K. and Lewis, C., editors, Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 235–240. New York: ACM Press. Wilson, A. D., Bobick, A. F., and Cassell, J. (1997). Temporal Classification of Natural Gesture and Application to Video Coding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 948– 954. New York: IEEE Press.
Chapter 6 ANALYSING MULTIMODAL COMMUNICATION Repair-Based Measures of Human Communicative Coordination Patrick G. T. Healey, Marcus Colman and Mike Thirlwell Interaction, Media, and Communication Research Group Department of Computer Science Queen Mary, University of London, UK
{ph, marcus, miket}@dcs.qmul.ac.uk Abstract
There are few techniques available to inform the design of systems to support human-human interaction. Psycholinguistic models have the potential to fill this gap however existing approaches have some conceptual and practical limitations. This chapter presents a technique, based on the conversation analytic model of breakdown and repair, for modality and task independent analysis of communicative exchanges. The rationale for the approach is presented and a protocol for coding repair is described. The potential of this approach for analysing multimodal interactions is discussed.
Keywords:
Communication, conversation analysis, repair, evaluation.
1.
Introduction
Communication is central to many human activities, even for tasks and technologies that are ostensibly individual, see e.g., [Suchman, 1987; Heath and Luff, 2000; Hutchins, 1995; Nardi and Miller, 1991]. It underpins the co-ordination of routine activities and people’s responses to unexpected contingencies. Ethnomethodological studies of workplace interaction have documented how technologies not directly designed to support communication are nonetheless recruited in unanticipated ways to facilitate communicative coordination, e.g., [Heath and Luff, 2000; Hughes et al., 1992; Bowers et al., 1995]. Technologies such as text-based messaging and chat tools, integrated conferencing systems and large-scale collaborative virtual environments are 113 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 113–129. © 2005 Springer. Printed in the Netherlands.
114
Advances in Natural Multimodal Dialogue Systems
explicitly designed to support human communication. However, current approaches to the design and evaluation of these technologies focus on the analysis of the human-system interaction. There are few techniques that directly address the analysis of human-human interaction. Existing conceptual and empirical analyses of human communication should have much to contribute to design and evaluation. However, there are few established techniques either for identifying the communication related requirements of a given activity or for evaluating the impact different technologies have on the effectiveness of communication. Despite the elegance of the empirical observations, workplace studies have a problematic relationship to design, see e.g., [Hughes et al., 1992; Button and Dourish, 1996]. They require intensive effort and special analytical skills. They are also retrospective in character and do not provide a basis for systematic comparisons across different technologies or situations. The models and techniques developed in the psycholinguistic tradition are ostensibly more promising. They explicitly aim to quantify communicative phenomena and develop explanations that support comparative and predictive generalisations. The earliest applications of this approach used simple structural measures of interaction such as turn-taking, interruptions, backchannels, and gaze, see e.g., [O’Conaill et al., 1993; O’Malley et al., 1996]. Although these measures support direct comparison between some aspects of, for example, video-mediated interaction and face-to-face exchanges, the coding categories are coarse and do not take account of communicative function. For example, the category of ‘interruptions’ is sometimes formulated as the occurrence of overlapping speech where there has been no signal that a speaker is relinquishing the floor. This conflates accidental overlap, disruptive or competitive interventions and cooperative interventions such as collaborative completions. As a result the distribution of e.g., numbers of turns, lengths of turn, and interruptions are consistent with a number of possible interpretations, cf. [Anderson et al., 1997]. The finding that, for example, video-mediated communication leads participants to use more turns and words than they would when face-to-face are thus difficult to interpret. It might indicate either that video adds something to the interaction or that participants are working to compensate for some limitations it introduces. This chapter begins by reviewing the strengths and limitations of two psycholinguistic approaches to the analysis of human-human interaction: Dialogue Games analysis and the Collaborative Model of Dialogue. We argue that both approaches have important limitations as techniques for analysing multi-modal dialogue. An alternative approach is presented that focuses on breakdown and repair: the detection and resolution of communication problems. We propose that this can provide more robust, comparative measures of the effectiveness of human-human interaction across tasks and modalities. In
Analysing Multimodal Communication
115
order to turn this approach into a practical tool it is first necessary to develop a coding protocol that supports reliable identification of repair phenomena across different modalities. We present a protocol and test it against a corpus of attested examples of conversational repair. We then go on to sketch some of its potential applications to the analysis of multi-modal interactions.
1.1
Dialogue Games Analysis
Problems with the interpretation of simple structural analyses have been addressed to some extent by techniques which analyse communicative function directly. For example, Anderson et al. [1997] compared the performance of subjects carrying out a map drawing task under different media conditions. This task has a reliable dialogue coding system developed for it which characterises the functional structure of the dialogues, analysing each utterance as a move (e.g., instruct, explain, check, clarify, query-w, align) and structured sequences of moves as dialogue games [Kowtko et al., 1991]. Anderson et al. [1997] have shown how this approach can be sued to compare the dialogue moves used to carry out the task either face-to-face, using audio only or using video-mediated communication. Unlike the structural analyses, this provides a means of isolating the communicative functions that are preserved or impeded by different media. Although it supports systematic comparisons between media, functional analysis also has limitations. The coding system was designed to exhaustively classify utterance function in a particular collaborative task, the map task. As a result the move types are tailored to the transactional character of the task. Although it has been shown to generalise to some other information exchange tasks, it is unclear whether the coding scheme is adequate for qualitatively different kinds of exchange such as competitive negotiation. A practical limitation is that because the coding system is exhaustive, every utterance is analysed and coded. This is labour intensive and has the consequence that the sensitivity of the coding system is traded-off against its coverage, see [Carletta et al., 1996]. An additional concern is that the functional analysis doesn’t discriminate between incidental exchanges and those necessary for the activity at hand. All exchanges are coded and all contributions are given equal weight in the analysis.
1.2
The Collaborative Model of Dialogue
A second approach to modelling communication that also improves on simple structural analyses is the collaborative model of dialogue (CM) developed by Clark and co-workers [Clark, 1996; Clark and Wilkes-Gibbs, 1986; Clark and Schaefer, 1989]. This focuses on the process through which people build up their common ground during interaction. The basic principle is that the par-
116
Advances in Natural Multimodal Dialogue Systems
ties to an interaction only consider an utterance, or other communicative act, to be successful where some positive evidence of its acceptance or ‘grounding’ has been obtained. The grounding principle is modified according to the types of evidence and the degree of effort required to secure mutual-understanding in a given case. Two of the most important qualifications are that; a) interlocutors always seek to reduce the joint, as opposed to individual, effort necessary to successfully ground a communicative act and b) interlocutors will attempt to ground a contribution only up to a criterion level (the grounding criterion) which is adjusted according to circumstances. This framework has been used to characterise the properties of different communicative media and modalities. Clark and Brennan [1991] identify a set of 8 constraints on grounding that derive from the signal characteristics of different communicative media (Copresence, Visibility, Audibility, Cotemporality, Simultaneity, Sequentiality, Reviewability and Revisability). These constraints alter the ease with which evidence of grounding can be provided and change the costs associated with different grounding techniques (Formulation costs, Production costs, Reception costs, Understanding Costs, Start-up costs, Delay costs, Asynchrony costs, Speaker Change costs, Fault costs, Repair costs). During interaction, individuals must make a trade-off between the different types of action they can undertake in order to ground a particular contribution and their cost in a particular medium. For example, where turn taking costs are high individuals may invest more in the construction of each utterance and less in attempts at concurrent feedback. The CM has also been applied to the analysis of system feedback. The CM distinguishes a number of levels at which an action is considered complete. For example, an utterance may be perceived but not understood, or it may be understood but the action it proposes is not undertaken. These action levels are ordered according to the principle that feedback which indicates completion at a higher level presupposes completion at a lower level: if I comply with your request then this is evidence that I have also heard and understood it. Consequently, the current state of the common ground with respect to communicative action can vary depending on the degree of grounding that has been secured. The maintenance of context in interaction with a system can be supported by giving feedback which signals the level of grounding that a particular action has achieved [Brennan, 1988]. The level of feedback given can be modulated by the risks associated with possible misunderstandings. Although it generalises to a variety of tasks and communication modalities there are some important limitations to these applications of the CM model. One problem is that it provides only a limited analysis of situations in which contributions fail to secure acceptance. Where this occurs there is a general expectation that some repair will ensue, for example through reformulation of the contribution in a way that is acceptable to the addressee(s). A num-
Analysing Multimodal Communication
117
ber of possible types of reformulation e.g., alternative descriptions, instalment descriptions and trial references, are distinguished but the pattern of choice amongst these types, and their relationship to the success of repairs, is not addressed in the model. Although it is explicitly acknowledged that processes of conversational repair play a critical role in sustaining the mutual-intelligibility of interactions [Brennan, 1988; Clark, 1996], no specific mechanism has so far been developed for dealing with this. A more practical limitation on the application of the CM to the analysis of interaction is its relative underspecification. As Clark and Brennan [1991] note the media constraints and costs invoked to explain particular patterns of communication in different media are heuristic. The list of costs and constraints are neither exhaustive nor exclusive and there is no systematic means of quantifying them or calculating the possible trade-offs involved. In order to do this some quantification of the communicative effort invested in an interaction would be required. Without a means of comparing the grounding criteria being employed in different cases, i.e. of estimating the grounding criterion, the rationale behind particular patterns of communicative response cannot be determined. For example, a pattern of small, instalment, contributions may indicate a situation in which turn costs are low or a situation in which the grounding criterion is high.
2.
Breakdown and Repair
The position developed in this chapter is that the problems with functional and CM analyses of communication identified above can be addressed by focusing on the analysis of breakdown and repair in communication. The distinguishing feature of this approach is that it is concerned only with those parts of an interaction in which communicative trouble is encountered. The approaches described above depend on positive characterisations of communicative success; on analysing how things go right. The present approach, by contrast, focuses on analysing how things go wrong, and how people respond to this. A detailed framework for characterising these situations has been developed in the conversation analytic (CA) tradition [Schegloff et al., 1977; Schegloff, 1987; Schegloff, 1992]. Before describing the potential application of this framework to the analysis of mediated interactions, it is important to set out the basic CA repair framework and to clarify the concept of communicative problem it invokes. Two basic aspects to the CA model are of particular relevance to the present chapter; the structure of repairs and how specific they are. Structurally, the CA repair framework distinguishes between three things; who initiates a repair, where in the turn taking structure it occurs, and the subsequent trajectory of the repair to completion. For example, self-initiation occurs where the speaker
118
Advances in Natural Multimodal Dialogue Systems
identifies a problem with one of their own turns in a conversation. Otherinitiation occurs in situations in which someone signals a problem with another participant’s turn. The point at which the problem is addressed, the ‘repair’, is also classified in terms of ‘self’ and ‘other’. Self-repair occurs where the person who produced the problematic utterance also addresses the problem, regardless of whether they signalled that it was problematic. Other-repair occurs where someone addresses a putative problem with someone else’s turn. Four positions are distinguished in which a problem can be signalled or addressed. First position repair occurs in the turn in which a problem occurs, second position repair takes place in the next turn that occurs as a response to the problem turn. Third position repair-initiation occurs in the next turn that occurs as a response to the second position and so on [Schegloff, 1992; Schegloff, 1987]. Repair initiations are also distinguished according to the specificity with which they localise a problem. Schegloff et al. [1977] proposed ordering of other-initiation types according to their power. The most general kind of initiation is a “huh?” or “what?” which signals that the utterer has a problem but gives almost no clues about its precise character. This is followed, in order of increasing specificity, by a ‘wh’ question, such as “who?” or “what”. In this case the nature of the signalled problem is clearer and could potentially be associated with a particular sub-part of the problematic turn. More specific still are a partial repeat plus a ‘wh’ question or just a partial repeat. This type of reprise clarification provides information about the specific element(s) of the original turn that caused the problem. The strongest form of repair initiation in Schegloff et al.’s ordering is to offer a full paraphrase or reformulation prefaced by “you mean . . . ”. In this situation the recipient of the turn has successfully recognised and parsed what was said but wishes to test a possible interpretation of it with the original speaker. Schegloff et al. [1977] propose that there is a preference for using the strongest or most specific type of initiation available in any given case. This is supported by the observation that weaker initiations are interrupted by stronger initiations and, where several initiations occur in sequence, they increase in specificity.
2.1
A Repair-Based Analysis
The CA repair framework can be adapted to the comparative analysis of technologically mediated communication. As noted, this framework focuses only on those junctures where something goes wrong in an interaction. The CA approach to breakdown and repair does not involve appeal to a model of what is ‘actually’ being communicated or a theory of error. For example, Schegloff [1992] states that: “adequacy of understanding and intersubjectivity is assessed not against some general criterion of meaning or efficacy (such as convergent paraphrase) and not
Analysing Multimodal Communication
119
by ’external’ analysts, but by the parties themselves vis-à-vis the exigencies of the circumstances in which they find themselves” [Schegloff, 1992, page 1338].
The question of whether a breakdown in communication occurs because of some objectively verifiable misunderstanding is explicitly suspended [Garfinkel, 1967]. The focus instead is on analysing the situations which the participants treat as problematic, independently of whether there is any ‘real’ problem. The analysis is not concerned with whether a turn was, from an observer’s viewpoint, correct or accurate but only with whether the participants treat it as intelligible. This gives rise to some potentially counter-intuitive analyses. For example, a request for clarification that does not signal a problem with the intelligibility of a preceding utterance is not a repair in this sense. In addition a complaint about audibility when using a computer conferencing system does not count as a communication problem unless there is evidence that the complaint itself was not understood. This orientation has two potential practical benefits. Firstly, the analysis of communicative coordination is separated from analysis of the task domain. It is not necessary for the analyst to develop a theory of what task people are engaged with or how it is carried out. The patterns of repair type and trajectory can be analysed independently of the transaction involved. In contrast to some functional schemes of analysis such as dialogue games, the analysis should thus be applicable to a variety of different kinds of communicative interaction. The second potential benefit is that it promises to improve the sensitivity of the analysis by concentrating attention on those exchanges in which communicative problems occur. This claim trades on the assumption that, relative to other kinds of exchange, the frequency with which problems are signalled and addressed is moderated by their perceived importance to the coherence of the interaction. This assumption is based on the observation in the CA literature that turns which initiate repair are avoided if possible, especially those that signal problems with another participant’s contribution [Schegloff et al., 1977]. A focus on repair should thus help to filter out those exchanges which participants consider incidental to their purposes from those which they consider essential.
2.2
Adapting the Repair Model
In order to exploit the CA model in the comparative analysis of interaction it is necessary to develop a coding protocol that operationalises and specifies criteria for identifying the instances of repair phenomena. The CA repair model was developed in a tradition in which statistical generalisations are often specifically eschewed, see e.g., [Schegloff, 1992; Schegloff, 1993]. For example, Schegloff [1993] argues that the attempt to contrast categories of ‘sociability’ of participants by reference to “laughter per minute” and “backchannels
120
Advances in Natural Multimodal Dialogue Systems
per minute” is flawed [Schegloff, 1993, page 104]. One concern is that ‘per minute’ is not a valid unit of conversation. Laughter and other phenomena are only relevant at certain points or ‘environments’ in a conversation. Frequent laughter outside these conversational contexts is more typically treated as inappropriate than as a sign of sociability. Schegloff [1993] also highlights the difficulty in defining or operationalising conversational phenomena. For example, the form of a contribution such as “Yeah” does not in itself determine if it is used as a backchannel or to signal a change of speaker, cf. [Levinson, 1983]. These concerns are less problematic, we believe, for developing a systematised analysis of repair. Firstly, Schegloff notes that the environments of occurrence for repair initiations are unrestricted: “In fact, because nothing can be excluded in principle from the class ‘repairable’ [Schegloff et al., 1977, page 363], such repair initiation by the recipient of some talk appears to be the only type of turn in conversation with an unrestricted privilege of occurrence; it can in principle occur after any turn at talk. In that respect, then, its “environments of relevant possible occurrence” are well defined.” [Schegloff, 1993, page 115].
Prima facie this applies as much to the producer of the talk as it does to the recipient and therefore can be extended to include both ‘self’ and ‘other’ initiation. Secondly, the form and trajectory of repairs have been more extensively studied, and are better understood, than many other conversational phenomena. Consequently, this provides a promising domain for the definition of criteria that support the reliable identification of instances of repair, se [Schegloff, 1993, page 115].
2.3
The Repair Protocol
The Repair Protocol presented here draws on the procedures for signalling and addressing communicative problems described by CA and adapts them to the purposes of developing a tool that supports reliable identification of the occurrence of repair phenomena in communication under different conditions and media. The complete repair protocol is shown in Figure 6.1.1 It has been designed to be useable by people who have no specific knowledge of the body of conversation analytic research that it draws on. It is constructed so that an analyst takes each contribution to an interaction and tests it against a number of questions. The questions employ a yes/no format and are formulated to be as simple as possible. Because a given turn may contain more than one repair related event the protocol is applied iteratively to a turn until no further repairs are 1 Key:
P1= Position; P2 = Position 2; P3 = Position 3; SI = Self-initiated; OI = Other-initiated; SR = Self-repair; OI = Other-repair; NTRI = Next Turn Repair Initiator.
NO
P2, SI, OR, (Formulation)
YES
NO
P1, SI, SR (Formulation )
Is this revision completed by another participant?
YES
P1, SI, SR Transition Space
Does the revision occur before completion (or a possible completion) of the contribution?
YES
YES
NO
Repair is P3, SI, SR
NO
Figure 6.1.
The Repair Protocol.
Repeat until no new instances are detected
Request/Prompt is P2 NTRI
Repair is P3, OI, SR
YES
YES
End
P2 OI, OR
YES
NO
End
P2 NTRI
YES
P2, NTRI, (Incomplete)
NO
Does the other participant acknowledge or accept the proposal to repeat or revise their contribution?
NO
Does the initiator also provide a proposed revision?
YES
Is this contribution introduced to propose repetition or revision of another participant’s contribution?
Is this contribution introduced to edit, amend, or reprise a previous contribution by the initiator?
NO
Is this contribution introduced to accept or confirm another participant’s interpretation of one of the initiator’s previous contributions?
NO
Was this edit, amendment, or reprise requested or intentionally prompted by another participant?
P1, SI, SR, (Articulation)
NO
Is the edit, amendment, or reprise introduced to change the meaning of the contribution?
YES
Does the initiator edit, amend, or reprise part of their contribution before another participant responds to it?
Analysing Multimodal Communication
121
122
Advances in Natural Multimodal Dialogue Systems
detected. The status of a contribution is often determined only by the response it receives. As a result, the protocol also operates partly retrospectively. For example, some position two repair initiations are only classified as such after the response to them has been analysed. Although built on empirical studies of verbal interactions, the protocol is designed to be modality neutral. It aims to capture repair phenomena across a variety of modalities including, for example, graphical and gestural interaction. As a result it refers to initiators rather than speakers and, following Clark and Schaefer [1989] contributions, and modifications to contributions, rather than turns. More importantly, it has also been designed to avoid reliance on clearly identifiable sequences of turns. This is important for situations such as text and whiteboard based interaction where turn sequence is, in principle, more flexible than in verbal interaction. The protocol does require, however, that analysts can identify what, if any, preceding parts of the interaction a contribution may be addressed to. It is an assumption of this approach that, for example, gestural and graphical analogues of the verbal repair phenomena can be identified. There are several other departures from the existing literature on repair that should be noted. Firstly, there is no attempt to isolate 4th position repairs because they are too rare to be useful for making quantitative comparisons. Secondly the protocol does not currently distinguish between third turn and third position repairs [Schegloff, 1987]. This is an issue we hope to address in the future. Thirdly it does not capture embedded repairs where a correction or adjustment is made without any explicit signalling of a problem [Jefferson, 1982]. From a practical perspective this phenomenon is difficult to capture but it is also unclear whether it reflects breakdown of the kind that is of interest here.
2.3.1 Example application of the repair protocol. To illustrate the application of the protocol we take an established example of position two other initiation and other repair from the Conversation analytic literature (turns are labelled T1-T5). T1: Lori: But you know single beds are awfully thin to sleep on. T2: Sam: What? T3: Lori: Single beds. They’reT4: Ellen: You mean narrow? T5: Lori: They’re awfully narrow, yeah. (from Schegloff et al. [Schegloff et al., 1977, page 378]) Starting at the top of Figure 6.1 we apply the first test to T1. In this case there is no evidence of changes (edit/amend/reprise) before Sam responds so the answer is ‘NO’ and we proceed down the right hand branch of the proto-
Analysing Multimodal Communication
123
col. The next box asks whether T1 is introduced to accept or confirm another participants’ interpretation of one of the initiator’s previous contributions. The answer to this, and the next two questions, is ‘NO’ and we exit the first iteration through the protocol. We then move onto T2. The answer to the first question is again ’NO’ and, following the right branch of the protocol, the answer to the second question is ’NO’. The answer for the next question: "Is this contribution introduced to propose repetition or revision of another participant’s contribution?" is ‘YES’ since the ‘wh’ question signifies a request for some sort of clarification. The next test is then: "Does the initiator also provide a proposed revision?" Here the answer is ‘NO’ and, since Lori (initiator) does subsequently respond to this request then T2 is provisionally classified as a position 2 (P2) next turn repair initiator (NTRI). We then proceed to T3. Again, going through the protocol in sequence leads to the second question on the right hand branch, "Is this contribution designed to edit, amend or reprise a previous contribution by the initiator?" In this case, the phrase "single beds" is a reprise of part of the earlier contribution by this speaker, so the answer is ‘YES’. As this reprise was prompted by another speaker, the answer to the following box is also ‘YES’. This labels the contribution as P3 OI SR, and confirms that the prompt was P2 NTRI. Note that it is only after T3 has been analysed that the classification of T2 is confirmed. This illustrates the sense in which the protocol is partly ‘backward-looking’. T4 is produced by a new participant, Ellen. Working through the protocol we encounter the question: "Is this contribution introduced to propose repetition or revision of another participant’s contribution?" The answer is ‘YES’ but, unlike T2, T4 also offers a candidate reformulation of Lori’s T1. As a result the answer to the next question: "Does the initiator also provide a proposed revision?" is also ‘YES’. T4 is therefore categorised as P2 OI OR (Position 2, other-initiated, other-repair). Note that although T4 occurs several turns after T1 it is still classified as ‘position two’ because it is oriented to T1 and not to any of the intervening turns. T5 confirms the previous interpretation (Ellen’s) of Lori’s earlier contribution (T1) and this leads to an ‘END’ box on the right hand side of the repair protocol.
2.3.2 Validity and reliability. The reliability and validity of the protocol have been initially assessed by analysing a corpus of 65 example repair sequences containing 76 instances of repair initiations/repair drawn from Jefferson [1982], Schegloff [1982, 1987, 1992, 2000] and Schegloff et al. [1977]. The examples were chosen to cover as many phenomena as possible. They were extracted and stripped of the markers of their original coding. Two of the authors, who were not involved in preparing the corpus, then independently applied a draft version of the protocol to these examples. Because the examples
124
Advances in Natural Multimodal Dialogue Systems
Table 6.1. Validity By Category. Target Phenomenon
Number of Instances
Percent Identified
P1SI SR P1 TS R P1 Fail P2 R P2 RI P2 OI Fail P3 SI SR P3 OI SR
8 2 2 11 28 3 16 6
88% 75% 50% 87% 71% 67% 75% 67%
Overall
76
75%
Table 6.2. Reliability Broken Down By Category. Target Phenomenon
Instances Judge 1
Instances Judge 2
Percent Agreement
Position 1 Position 2 Position 3
83 64 39
75 66 45
63% 72% 62%
Self Other
122 64
120 66
63% 73%
Repair Initiation
145 41
153 33
63% 50%
drawn from the literature contain more repair phenomena than those they are used to illustrate, this provides a way to estimate both validity (by comparison with original analyses) and reliability by comparison between judges. Overall validity was good with on average 75% of the original examples being assigned to the same category as the original paper. Table 6.1 breaks down the results by category. Note that for some phenomena the number of instances is very low and confidence in the effectiveness of the protocol analysis for those categories is consequently reduced. Inter-judge reliability was calculated as percentage agreement in each of the categories: Position, Self-Other, Repair and Initiation. This was lower at 64% however it should be noted that this way of calculating reliability cannot account for levels of agreement on contributions that did not involve any repair (it is unclear how to count ‘unrepaired’ events). As a result the effective level of reliability is higher than these figures suggest. The scores for each category are given in Table 6.2. Examination of a confusion matrix showed that levels of
Analysing Multimodal Communication
125
agreement were reduced primarily by situations in which one judge identified something that another did not. Only in relatively few situations were the same repair events assigned to different categories. The primary source of confusion was between classifications of events as position 1 formulation problems and position 1 articulation problems. Although there is a large body of research which has applied the CA analyses to a variety of examples, to our knowledge this is the first time validity and reliability of repair categorisation has been assessed. These initial results suggest that this framework can be used to carry out systematic, replicable, analyses of corpora of interactions.
3.
Analysing Communicative Co-ordination
Repair per se is not necessarily an indicator of lower communicative coherence. It could, for example, reflect greater efforts to understand exactly what is being said or reflect a more complex exchange. Instead of making global comparisons of the frequency of repair, the present proposal is to use the structure and distribution of specific types of breakdown and repair to provide indices of communicative co-ordination. A systematic test of this application of the protocol is the subject of ongoing work. This section illustrates the potential of this approach to analyse multi-modal communication and suggests some of the potential measures of communicative co-ordination it could provide. The simplest index that this approach can provide is a measure of the difficulty of producing a contribution in a particular medium. In conversation a significant proportion of utterances show problems with articulation. Analogous problems arise in text chat where typos are frequently a problem, and in drawing where problems can arise in drawing complex shapes or fine detail. In the protocol problems of this kind are classified as Articulation problems and their frequency of occurrence provides a basic index of the difficulty of externalising a contribution in a particular medium. Because the protocol captures only those ‘typos’ or ‘disfluencies’ that the initiator of a contribution chooses to correct, it reflects the participant’s estimate of the impact of the articulation problem on the effectiveness of the interaction. It can also be used to index the grounding criteria (see above) that an individual employs. For example, if we hold task and media constant the frequency of Articulation repairs should be proportional to the effort participants are investing in making their contributions clear. A second index that the protocol can provide is a measure of the effect of a medium on the difficulty of formulating a contribution. In the protocol this is captured by the position 1, self-initiated, self-repairs (P1,SI,SR). These are repairs in which the initiator makes modifications that, unlike typos, alter the possible meaning of their contribution during production. For example a
126
Advances in Natural Multimodal Dialogue Systems
referring expression may be rephrased by replacing “he” with “she”, or part of a drawing may be erased and redrawn before being presented as complete. It might be expected that media which produce a persistent representation of a contribution, e.g., text chat or email, should, all things being equal, lead to more Formulation repairs than those that produce a transient representations, for example speech. Alternatively, if we hold the medium constant then the frequency of Formulation repairs should vary as a function of the cognitive load the task places on participants. Perhaps the most interesting potential measures are those which promise to directly index the communicative load associated with different media and tasks. One way in which this could be addressed is by assessing the frequency of, for example, position two and position three repair initiations under different task and media conditions. All things being equal, if a particular medium alters the intelligibility of interaction in some situation then this should be reflected in the frequency of repair initiations. More subtle distinctions can also be made. Arguably, self-initiated, self-repair in position three is indicative of high communicative co-ordination since it depends on sensitivity to a recipient’s interpretation of one of the initiator’s preceding utterances. Measures like this could provide for characterisation of the relative communicative ‘transparency’ of different media. A further interactional measure can be derived from the analysis of the specificity of the problems encountered. The ability of interlocutors to efficiently localise and deal with a problem provides an index of their communicative coherence. One possibility provided directly by the CA framework is to exploit the ranking of initiation types according to their power to locate a ‘repairable’ as discussed above. However, the notions of paraphrase and ‘wh’ question do not generalise to non-verbal interaction. If it is assumed that, all things being equal, more severe problems will require more extensive repairs then analysis can focus on the amount of the preceding material that is replaced. For verbal exchanges this could be indexed by the proportion of words altered or amended in the repair. For graphical exchanges it can be indexed by the proportion of a drawing or sketch that is revised.
4.
Discussion
This chapter has proposed that a repair-based analysis can provide useful operationalisations of several aspects of communicative co-ordination. We have shown that repairs can be systematically identified and described some ways in which these phenomena can be used to assess the relative effectiveness of human-human interaction. The discussion of specific measures and their application to multi-modal interaction is however speculative. The present claim is that a repair-based approach can provide a more effective analysis of com-
Analysing Multimodal Communication
127
municative coordination than existing applications of psycholinguistic techniques. One important gap in the current protocol is that it doesn’t directly capture problems with the handover of turns. Informal observations of remote multi-modal communication have suggested that this is an important class of communication problem, particularly where there are more than two participants. An interesting corollary of the present analysis is that communicative coherence should be enhanced by situations or technologies that make visible as much of the structure of each individual’s contribution as possible. This follows from the observation that repairs are initiated and effected by manipulating the structure of preceding contributions. To give a concrete example, in current implementations of text chat it is difficult to signal a problem with preceding turns. It is difficult both to identify the preceding turn itself and to identify what elements of the turn were problematic. Users usually have to repeat the problematic material together with some identifier in order to effect a repair initiation. Shared whiteboards, by contrast, support much simpler devices. For example, users can circle or underline problematic contributions directly. The original contribution is thus more easily edited on a whiteboard than in text chat. Media or environments that allow users to manipulate the structure of each others contributions should, on this view, provide more effective support for co-ordinating understanding.
Acknowledgements This work was partly supported by AvayaLabs, US under the “Mixed Mode Communication Project” in the Department of Computer Science at Queen Mary University of London. This chapter supersedes an earlier, unpublished, version presented at the 1999 AAAI fall symposium “Psychological Models of Communication in Collaborative Systems". We gratefully acknowledge Anjum Khan, Ioannis Spyradakis and Sylvia Wilbur for their contributions to the development of this work.
References Anderson, A. H., O’Malley, C., Doherty-Sneddon, G., Langton, S., Newlands, A., Mullin, J., Fleming, A. M., and van der Velden, J. (1997). The impact of VMC on Collaborative Problem Solving: An Analysis of Task Performance, Communicative Process and User Satisfaction. In Finn, K. E., Sellen, A., and Wilbur, S. B., editors, Video-Mediated Communication, pages 133–156. Mahwah, New Jersey: Lawrence Erlbaum Associates. Bowers, J., Button, G., and Sharrock, W. (1995). Workflow from Within and Without: Technology and Co-operative Work on the Print Industry Shop
128
Advances in Natural Multimodal Dialogue Systems
Floor. In Proceedings of the fourth European conference on ComputerSupported Co-operative Work, pages 51–66. New York: ACM Press. Brennan, S. (1988). The Grounding Problem in Conversations with and through Computers. In Fussell, S. R. and Kreuz, R. J., editors, Social and Cognitive Approaches to Interpersonal Communication, pages 201–225. Mahwah, New Jersey: Lawrence Erlbaum Associates. Button, D. and Dourish, P. (1996). Technomethodology: Paradoxes and Possibilities. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 19–26. New York: ACM Press. Carletta, J., Isard, A., Isard, S., Kowtko, J., Doherty-Sneddon, G., and Anderson, A. (1996). The Reliability of a Dialogue Structure Coding Scheme. Computational Linguistics, 23(1):13–31. Clark, H. H. (1996). Using Language. Cambridge: Cambridge University Press. Clark, H. H. and Brennan, S. (1991). Grounding in Communication. In Resnick, L., Levine, J., and Teasley, S., editors, Perspectives on Socially Shared Cognition, pages 127–150. American Psychological Association. Clark, H. H. and Schaefer, E. F. (1989). Contributing to Discourse. Cognitive Science, 13:259–294. Clark, H. H. and Wilkes-Gibbs, D. (1986). Referring as a Collaborative Process. Cognition, 22:1–39. Garfinkel, H. (1967). Studies in Ethnomethodology. Englewood Cliff: Prentice Hall. Heath, C. and Luff, P. (2000). Technology in Action. Cambridge University Press, Cambridge. Hughes, J., Randall, D., and Shapiro, D. (1992). Faltering from Ethnography to Design. In Proceedings of the Conference on Computer Supported Cooperative Work (CSCW), pages 115–122. Hutchins, E. (1995). How a Cockpit Remembers its Speeds. Cognitive Science, 19:265–288. Jefferson, G. (1982). On Exposed and Embedded Correction in Conversation. In Button, G. and Lee, J. R., editors, Talk and Social Organisation, pages 86–100. Clevedon: Multilingual Matters. Kowtko, J. C., Isard, S. D., and Doherty, G. M (1991). Conversational Games within Dialogue. In Caenepeel, M., Delin, J. L., Oversteegen, L., Redeker, G., and Sanders, J., editors, Proceedings of the DANDI Workshop on Discourse, HCRC, Edinburgh, UK. Levinson, S. C. (1983). Pragmatics. Cambridge: Cambridge University Press. Nardi, B. and Miller, J. (1991). Twinkling Lights and Nested Loops: Distributed Problem Solving and Spreadsheet Development. International Journal of Man-Machine Studies, 34:161–184.
Analysing Multimodal Communication
129
O’Conaill, B., Whittaker, S., and Wilbur, S. (1993). Conversations over Video Conferences: An Evaluation of the Spoken Aspects of Video-Mediated Communication. Human-Computer Interaction, 8:389–428. O’Malley, C., Langton, S., Anderson, A., Doherty-Sneddon, G., and Bruce, V. (1996). Comparison of Face-to-Face and Video-Mediated Interaction. Interacting with Computers, 8(2):177–192. Schegloff, E. A. (1982). Recycled Turn Beginnings: A Precise Repair Mechanism in Conversation’s Turn-Taking Organisation. In Button, G. and Lee, J. R., editors, Talk and Social Organisation, pages 70–85. Clevedon: Multilingual Matters. Schegloff, E. A. (1987). Some Sources of Misunderstanding in Talk-in-Interaction. Linguistics, 25:201–218. Schegloff, E. A. (1992). Repair after Next Turn: The Last Structurally Provided Defense of Intersubjectivity in Conversation. American Journal of Sociology, 97(5):1295–1345. Schegloff, E. A. (1993). Reflections on Quantification in the Study of Conversation. Research on Language and Social Interaction, 26(1):99–128. Schegloff, E. A. (2000). When ’Others’ Initiate Repair. Applied Linguistics, 21(2):205–243. Schegloff, E. A., Jefferson, G., and Sacks, H. (1977). The Preference for SelfCorrection in the Organization of Repair in Conversation. Language, 53(2): 361–382. Suchman, L. (1987). Plans and Situated Actions: The Problem of HumanMachine Communication. Cambridge: Cambridge University Press.
Chapter 7 DO ORAL MESSAGES HELP VISUAL SEARCH? Noëlle Carbonell and Suzanne Kieffer LORIA, 615 rue du Jardin Botanique 54600 Villers-lès-Nancy, France
{Noelle.Carbonell, Suzanne.Kieffer}@loria.fr Abstract
A preliminary experimental study is presented, that aims at eliciting the contribution of oral messages to facilitating visual search tasks on crowded displays. Results of quantitative and qualitative analyses suggest that appropriate verbal messages can improve both target selection time and accuracy. In particular, multimodal messages comprising a visual presentation of the isolated target together with absolute spatial oral information on its location in the displayed scene seem most effective. These messages also got top-ranking ratings from most subjects.
Keywords:
Multimodal interaction, multimedia presentations, visual search, spatial oral messages, speech and graphics, usability experimental study.
1.
Context and Motivation
1.1
Multimodality: State of the Art
Numerous forms of speech-based input multimodality have been proposed, implemented and tested. Combinations of speech with gestural modalities have been studied extensively, especially combinations of speech with modalities exploiting new input media, such as touch screens, pens, data gloves, haptic devices. Both usability and implementation issues have been considered; see, among others, [Oviatt et al., 1997]1 or [Robbe et al., 2000]2 for the first category of issues, [Nigay and Coutaz, 1993; Stock et al., 1997] for the second category.
1 On 2 On
speech and pen. speech and finger gestures on a touch screen.
131 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 131–157. © 2005 Springer. Printed in the Netherlands.
132
Advances in Natural Multimodal Dialogue Systems
Contrastingly, speech combined with text and graphics has only motivated a few studies. As an output modality, speech is mostly used either as a substitute for standard visual presentation modes (cf. phone services) or for supplementing deficiencies in visual exchange channels. Recent research efforts have been focusing on two main application domains: Providing blind or partially sighted users with easy computer access, e.g. [Grabowski and Barner, 1998; Yu et al., 2001], and Implementing appropriate interaction facilities in contexts of use where access to standard screen displays is difficult or even impossible. This is the case, for instance, in contexts where only small displays are available (e.g. PDAs and wearable computers), or in situations where the user’s gaze is involved in other activities (e.g. while driving a car); see, for instance, [Baber, 2001] concerning the first class of situations, and [de Vries and Johnson, 1997] concerning the second one. However, there is not yet, at least to our knowledge, a substantial amount of scientific work on the integration of speech into the output modalities of standard user interfaces, that is interfaces intended for standard categories of users using standard application software in standard environments and contexts of use; graphical user interfaces (or GUIs) implementing direct manipulation design principles [Shneiderman, 1983] are still prevailing. Published research on output forms of multimodality including speech amounts to usability studies of the role of speech comments in multimedia presentations, such as [Faraday and Sutcliffe, 1997], and contributions to the automatic generation of multimedia presentations, cf. [André and Rist, 1993; Maybury, 1993; Maybury, 2001]. This lack of interest for output forms of multimodality on the part of the research community may result, at least partly, from the fact that, although multimedia and multimodality refer to different concepts, these terms are often used as synonyms, especially when applied to system outputs. See [Maybury, 2001] for precise definitions of both terms.
1.2
Motivation and Objectives
The definition of the information content and semantics of multimedia presentations is commonly viewed as the responsibility of experts in the specific application domain considered. It is seldom viewed as lying within the scope of research on human-computer interaction. Therefore, the frequent assimilation of multimodality to multimedia may explain why the design of appropriate multimodal system responses has raised but little interest in the user interface research community, especially from an ergonomic angle, save for studies focused on specific categories of users or specific contexts of use. However, if standard users are offered speech facilities together with other input modalities,
Do Oral Messages Help Visual Search?
133
it is mandatory that the system responses are not limited to visual messages. Communication situations where one interlocutor can speak and the other cannot, are rather unusual. Research is then needed on the usability and software issues concerning the generation of appropriate multimodal system responses in standard human-computer interaction environments, and for standard user categories, including the general public. The main objective of the preliminary experimental study presented here is to contribute to scientific advances in this research area, by addressing one of the major usability issues relating to the generation of effective oral system messages, namely: How to design oral messages that facilitate the visual exploration of crowded displays? In particular, how to design messages that effectively help users to locate specific graphical objects on such displays? Perceptual and cognitive processes involved in visual search vary according to whether the goal of the search is familiar or not to the user. Familiarity here refers to the user’s visual familiarity with the graphical object they are looking for. Resorting to linguistic deictics and visual enhancements of graphical targets is a solution which seems “natural”. However, it is no more effective than the sole visual enhancement of the target. Another approach is to implement a human-like animated embodiment of the system, to visualise it on the screen and to endow it with a pointing device; see, for instance the PPP 3 persona which impersonates a car dealer, and uses a pointing stick for attracting the user’s attention on the assets of the currently displayed car [André, 1997]. However, the contribution of personae to the usability and effectiveness of human-computer interaction is still unclear, cf. [van Mulken et al., 1999]. Further testing is required in order to determine the usefulness of animated graphical system embodiments in graphical user interfaces. These reasons explain why we chose to focus first on assessing whether oral messages including spatial information actually facilitate the visual exploration of complex displays, especially the localization of graphical targets, in the context of standard human-computer interaction and environment. We selected visual search as the experimental task for the following reasons. It is one of the few human visual activities, besides reading, that have motivated a significant amount of psychological research; see, for instance, [Henderson and Hollingworth, 1998], [Findlay and Gilchrist, 1998] or [Chelazzi, 1999]. The 3 PPP
means ‘Personalized Plan Based Presenter’. See also Candace Sidner’s robot, a penguin which can point at locations on large horizontal displays with its beak [Sidner and Dzikovska, 2002].
134
Advances in Natural Multimodal Dialogue Systems
design of numerous computer applications may benefit from a better knowledge of this activity, especially applications for the general public such as: Online help to current interactive application software. For instance, novice users interacting with present graphical interfaces are often confused by the increasing number of toolbars and icons displayed concurrently; Map reading environments (cf. geographical applications), and navigation systems in vehicles; Data mining in visualisations of very large data sets; see, for instance, the complex hyperbolic graph visualisations proposed in [Lamping et al., 1995], or the treemap representations in [Fekete and Plaisant, 2002], and [Shneiderman, 1996; Card and Mackinlay, 1997; Card et al., 1999] for a general overview of visualisation techniques and their use. The methodology and experimental set-up of the experiment presented here are described in the next section, together with the underlying scientific hypotheses. Then, quantitative and qualitative results are presented and discussed in Section 3. Future research directions stemming from these results are described in the general conclusion, cf. Section 4.
2. 2.1
Methodology and Experimental Set-Up Overall Experimental Protocol
To assess the potential contribution of oral spatial information to facilitating visual search, we designed a preliminary experiment with: target presentation mode as independent variable; target search+selection time and target selection accuracy, as dependent variables. Eighteen subjects were to locate and select visual targets in thirty six complex displayed scenes, using the mouse. They were requested to carry out target localization and selection as fast as they could. Colour displays only were used. Each scene display was preceded by one out of three possible presentations of the corresponding target: Display of the isolated target at the centre of the screen during three seconds (visual presentation or VP); Oral characterization of the target plus spatial information on its position in the scene (oral presentation, OP);
135
Do Oral Messages Help Visual Search? Table 7.1. Overall task set-up. Group
VP
OP
MP
Group
VP
OP
MP
G1 G2 G3
P1 P3 P2
P2 P1 P3
P3 P2 P1
G4 G5 G6
P2 P1 P3
P1 P3 P2
P3 P2 P1
Gi : group of 3 subjects (3 × 6 = 18 subjects). Pi : set of 12 visual scenes (3 × 12 = 36 scenes).
Simultaneous display of the visual and oral presentations of the target (i.e. multimodal presentation, MP). These three sets of thirty six dual stimuli4 defined three experimental situations, namely the VP, OP and MP conditions. In the MP condition, the visual and oral presentations used in the VP and OP conditions respectively were presented simultaneously. Subjects were randomly split up into three groups, so that each subject processed twelve pairs of stimuli per condition, and each pair of stimuli was processed by six subjects. In order to neutralize possible task learning effects, the processing order was counterbalanced inside each group of subjects as follows: VP-PO-MP (three subjects), and OP-VP-MP (three subjects). All subjects performed the MP condition last. The size of groups in usability studies or cognitive ergonomics experimental studies seldom exceeds six subjects, most likely because analysing the behaviours of subjects performing realistic tasks in realistic environments is indeed a costly undertaking; for instance, Ahlberg, Williamson and Shneiderman [1992] report an experimental evaluation of three different user interfaces (meant for exploring information spaces) which also involved eighteen subjects split up into three groups of six subjects each, one group per user interface. The overall set-up is summed up in table 7.1. Experimental design choices were mainly motivated by the intent to assess the soundness of the three following working hypotheses which use the VP condition as the reference situation: A. Multimodal presentations of targets will reduce selection times and error rates in comparison with visual presentations. B. Oral presentations of targets will also improve accuracy, compared to visual presentations. is 108 pairs of stimuli (36 scenes × 3 presentation modes), each pair consisting in a presentation (visual, oral or multimodal) of a target followed by the display of the scene including this target.
4 That
136
Advances in Natural Multimodal Dialogue Systems
C. The type of spatial information included in oral target presentations will influence selection times and error rates. In particular, absolute and relative spatial indications will prove more effective than references to a priori knowledge (see subsection 2.3 for illustrations of these concepts), and absolute spatial information will be more effective than relative spatial indications. These hypotheses are based on common sense reasoning, in the absence, at least to our knowledge, of earlier published results and models stemming from experiments comparable to ours, that is involving similar tasks and interaction environments. Targets were unique graphical objects in the scenes including them. However, since they were presented out of context in the VP presentation mode, they might be easily confounded with other graphical objects in the scene, while unambiguous linguistic designations of targets would prevent detection errors due to such confusions from occurring. It follows then that accuracy would be higher in the OP and MP conditions than in the VP condition. As regards selection times, we assumed that subjects would use the spatial information included in oral messages, and that this information would enable them to focus visual search for the current target on a rather limited area in the scene. Scenes being complex and displayed on a rather large screen5 , we should then observe sensibly shorter selection times in the OP and MP conditions than in the VP condition. Concerning hypothesis C, absolute spatial information implies a one-step target localization process, while relative spatial information induces a twostep visual search; the latter is then less effective than the former as regards selection time, cognitive workload and accuracy. As for references to a priori knowledge, they involve more complex cognitive processes than the two previous types of spatial information; in addition, cultural knowledge varies greatly among users. In the remainder of the section, further information is given on: the criteria used for selecting visual scenes and targets (2.2); the structure and information content of oral messages (2.3); subjects’ profiles (2.4); the experimental set-up (2.5); the methodology adopted for analysing subjects’ results (2.6). 5 Scenes
included several scores of graphical objects and were displayed on a 21 inches screen.
Do Oral Messages Help Visual Search?
2.2
137
Scene Selection Criteria
Most visual scenes were taken from currently available Web pages in order to provide subjects with realistic attractive task environments. They were classified according to criteria stemming from Bernsen’s taxonomy of output modalities [Bernsen, 1994], our aim being to investigate the possible influence of the type of visual scenes displayed on target selection times and accuracy. Our classification was derived from the graphical categories in Bernsen’s taxonomy as follows. We focused on static graphical displays exclusively6 , on the ground that the localization and selection of moving targets in animated visual presentations is a much more complex activity than the selection of still targets in static visual presentations. Issues relating to the exploration of visual animated scenes will be addressed at a later stage in our research. We established two main classes of static presentations: Class 1 comprises displays of structured or unstructured collections of symbolic or arbitrary graphical objects, such as maps, flags, graphs, geometrical forms (cf. classes 9, 11, 21, 25 in Bernsen’s taxonomy); Class 2 includes displays of realistic objects or scenes, namely photographs or naturalistic drawings figuring complex real objects (e.g. monuments) or everyday life environments, such as views of rooms, town or country landscapes, . . . (cf. class 10 in Bernsen’s taxonomy). Half of the thirty six visual scenes belonged to class 1, and the other half to class 2. Class 1 and class 2 scenes in each of the three subsets described in subsection 2.1 (cf. table 7.1) were randomly ordered. Targets were objects or component parts of complex objects (cf. the complex real objects in class 2). They were chosen according to the following criteria. An acceptable target was a unique definite graphical object that could be designated unequivocally by a short simple verbal phrase. Although all targets were unique, some of them could be easily confused (visually) with other objects in the scene. In order to avoid task learning effects, target visual properties7 and position were varied from one scene to another.
2.3
Message Structure and Content
All messages included a noun phrase meant to designate the target unequivocally. For instance, “the pear” refers to the target unequivocally in the realistic scene reproduced in Figure 7.1. For any target, we chose the shortest simplest 6 Cf.
the five types of static graphical presentations in Bernsen’s taxonomy, namely classes 9, 10, 11, 21, 25. 7 These properties mainly include: colour, shape, orientation and size (within the limits of the fixed size target presentation box).
138
Advances in Natural Multimodal Dialogue Systems
Figure 7.1. pear.”
Basket with fruit (class 2 picture). Oral message: “On the left of the apple, the
noun phrase that could characterize it, with respect to other elements in the scene, without ambiguity and redundancy. For instance, one of the scenes represented geometrical figures and included several squares. The target we chose was the smallest square and the only pink one. We referred to one of these features, the colour, to appropriately reduce the scope of the substantive “square” in the oral message; “the pink square” is a noun phrase that refers to the target without ambiguity or redundancy. On the other hand, the use of the substantive “pear” is sufficient for referring unambiguously to the target in the scene reproduced in Figure 7.1. We experimented three types of spatial information in the verbal phrases referring to target locations, using an ad hoc classification inspired by the taxonomy presented in [Frank, 1998]: Absolute spatial information (ASI), such as “on the left/right” or “at the top/bottom”; Relative spatial information (RSI), for instance “on the left of the apple” (cf. Figure 7.1); Implicit spatial information (ISI), that is spatial information that can be easily inferred from common a priori knowledge and the visual context; for instance, it is easy to locate and identify the Mexican flag among twenty other national flags from the simple message “The Mexican flag.”, if the scene represents a planisphere, and national flags are placed inside the boundaries of the matching countries (cf. Figure 7.2).
Do Oral Messages Help Visual Search?
139
Figure 7.2. National flags (class 1). Oral message: “The Mexican flag.”
Messages included one or two spatial phrases, according to scene complexity; phrase pairs comprised one or two types of spatial information, namely ASI+RSI or ASI+ISI. Careful attention was paid to the choice of spatial prepositions [Burhans et al., 1995]. In order to make the assessment of hypothesis C possible, all messages had the same syntactical structure, so that information content was the only factor pertaining to the design of messages that could influence localization times and selection errors. The following structure, which emphasizes spatial information, was adopted for most messages, some ISI messages including no spatial phrase: [Spatial phrase] + Noun phrase (designation) This structure was preferred to the usual one (Noun phrase+Spatial phrase), based on the assumption that information in messages would be more effective if it was presented in the same order as it would be interpreted and used by the user.
2.4
Subjects’ Profiles
As this study involved a restricted number of subjects (18) and was a first attempt at validating hypotheses A, B and C, we defined strong constraints on subjects’ profiles in order to reduce inter-individual diversity, especially regarding task performance, and limit the number of factors that might influence
140
Advances in Natural Multimodal Dialogue Systems
subjects’ performances. To achieve homogeneity, we selected 18 undergraduate or graduate students in computer science with normal eyesight8 and ages ranging from 22 to 29. Thus, all participants were expert mouse users with alike quick motor reactions, familiar with visual search tasks, and capable of performing the chosen experimental tasks accurately and rapidly.
2.5
Experimental Set-Up
First, the experimenter presented the overall experimental set-up. Then, after a short training (6 target selections in the VP situation), each subject processed 12 scenes per situation, in the order VP+OP+MP or OP+VP+MP. Before each change of condition (i.e. before each change in target presentation), the experimenter explained the new specific set-up to the subject. For each visual scene: The target was first presented, during three seconds: – either visually in a fixed-size box in the centre of the screen; – or orally (together with a blank screen); – or orally and visually, simultaneously, that is resorting to synergetic multimodality [Coutaz and Caelen, 1991] redundantly [Coutaz et al., 1995]. Then, a button appeared in the centre of the screen together with a written message requesting the subject to click on the button for launching the display of the scene. Therefore, at the beginning of each target selection step, the mouse was positioned in the centre of the screen, making it possible to compare subjects’ selection times. The next target was presented as soon as the subject had clicked on any object in the current displayed scene. At the end of the session, subjects had to fill in a questionnaire requesting them to rate the difficulty of each task using a four degree scale (ranging from “very easy” to “very difficult”). The session ended up in a debriefing interview.
2.6
Analysis Methodology
Quantitative results comprise: average localization + selection times; and error (i.e. wrong target selections) counts or percentages. 8 Save
for one subject who was slightly colour-blind.
Do Oral Messages Help Visual Search?
141
computed over all subjects and scenes, as well as per class of scenes and per type of oral message. We tested the statistical significance of these quantitative results whenever a sufficient number of samples was available. Qualitative analyses of subjects’ performances, especially comparisons between the numbers of target selection errors in the VP, OP and MP conditions, provided useful information for defining further research directions. In order to elicit the possible factors at the origin of selection errors, scenes and targets were characterized using the following criteria: Scene characterization: – complexity (according to the number of displayed objects); and, – for class 1 scenes only, visual structure (e.g. random layout of objects; tree, crown or matrix structures; . . . ). Target characterization: – position on the screen (centre, left, . . . ); – visual saliency; – familiarity versus unfamiliarity (oddness); – ambiguity (i.e. the number of possible visual confusions with other graphical objects in the scene). Quantitative and qualitative results are presented and discussed in the next section.
3. 3.1
Results: Presentation and Discussion Quantitative Results
Results were computed over 34 scenes; two scenes (both in class 1) had to be excluded from quantitative and qualitative analyses by reason of technical incidents. Therefore, the corpus of experimental data we actually used comprised the results of 612 visual search tasks, that is 204 tasks per condition (VP, OP, or MP), each scene having been processed in each condition by 6 subjects. Statistical tests, mainly t-tests, were performed on these performance data (successes/failures and execution times), especially on the three sets of data collected in the three conditions.
142
Advances in Natural Multimodal Dialogue Systems
Table 7.2. Subjects’ performances per target presentation mode. Target presentation mode
Number of errors
Average selection time (sec.)
Standard deviation (sec.)
VP OP MP
31 14 8
2.83 3.92 2.70
1.70 3.50 1.93
Target presentation mode VP versus OP VP versus MP OP versus MP
Number of errors t=-2.70 t=-3.94 t=-1.31
p=0.007 p<0.0001 p=0.189
Average selection time (sec.) t=+3.79 t=-0.70 t=-4.20
p=0.0002 p=0.4852 p<0.0001
Upper half: best results are in bold type, and lowest ones are underscored. Lower half: significant statistical results are in bold type.
3.1.1
Global analysis.
Description. Concerning the order of conditions (VP then OP, versus OP then VP), comparisons between subjects’ performances in each of the two groups yielded no significant inter-group differences. These results indicate that no perceptible task learning effect occurred in the course of the experiment. The absence of any such effect is not surprising, due to the low number of tasks in each condition (i.e., 12), and the brief duration of the overall session (about 10 min.). Therefore, learning effects may be rightly excluded from the factors to be considered in the interpretation of the quantitative and qualitative results of our analyses, even as regards the MP condition although it was performed last by all subjects. As for selection accuracy, oral messages proved much more effective than visual target presentations, as shown by comparisons between the VP and OP conditions (cf. table 7.2). The total number of errors in the OP condition decreased by 55%. However, selection was slower in the absence of prior visualizations of isolated targets, that is, in comparison with the VP and MP conditions. Average selection time in the OP condition increased by over 38% compared to the VP condition. These differences are statistically significant. Table 7.2 also shows that multimodal presentations of targets reduced selection times compared to oral presentations, and error rates compared to visual presentations, both results being statistically significant. These results are in keeping with those observed for selection errors. To conclude, results presented in 7.2 confirm hypothesis B whereas hypothesis A is confirmed for selection error numbers only.
Do Oral Messages Help Visual Search?
143
Interpretation and discussion. As regards accuracy, spatial indications and target verbal designations included in oral messages may have reduced the frequency of visual confusions, the former by reducing the scope of visual search for the target, the latter by preventing confusions between the target and graphical objects of similar visual appearance in the scene. Significantly longer average selection time in the OP condition, together with a much higher standard deviation, may be explained by the fact that subjects were unfamiliar with the visual search tasks they had to carry out during the OP condition; this situation being rather unusual compared to the VP and MP situations which occur frequently in everyday life. Therefore, the higher variability of selection times in the OP condition may be assumed to reflect the high inter-individual diversity of cognitive abilities and learning processes. However, the longer selection times observed in the OP condition may be explained more satisfactorily by the intrinsic differences between the tasks subjects had to perform in the OP and VP conditions. Visual search of a visually known graphical target is a much less complex task than searching for a graphical object that matches a given verbal specification, even if only a part of the scene needs to be explored thanks to spatial verbal information. Bieger and Glock [1986] observed similar effects on the performances of subjects in an experiment that aimed at comparing the efficiency of text and graphics for presenting various types of information in instructional material. In particular, concerning spatial information9 , Bieger and Glock found that subjects who were given textual presentations of spatial information made fewer task execution errors than those who were given graphical presentations of the same information; on the other hand, the latter completed the prescribed tasks faster. Nevertheless, further comparisons between this study and ours would be meaningless, due to important design differences between the two experimental protocols: in Bieger’s and Glock’s study, the tasks subjects had to carry out were procedural assembly tasks (versus target selection in our experiment), and the modalities considered were text and graphics (versus speech and graphics). Concerning subjects’ performances in the MP condition, comparisons with their performances in the two other conditions suggest that, in the MP condition, they succeeded in making the most of the information provided by each modality, thus compensating for the weaknesses of each unimodal presentation mode. A likely interpretation is that they achieved: easy disambiguation of possible confusions between targets and other objects of similar appearance in the scene, thanks to spatial and denominative verbal information; and 9 Namely
the final position and orientation of elements in the workspace (assembly tasks).
144
Advances in Natural Multimodal Dialogue Systems
rapid identification of targets thanks to visual information, matching based on visual characteristics being faster than identification based on abstract properties or stereotyped mental representations, which involves complex decision-making processes. Surprisingly enough, spatial verbal information which, according to hypothesis A, should have accelerated target localization, did not actually contribute to reducing selection times significantly, or so it seems. This finding, together with the longer selection times observed in the OP condition may be explained within the framework of current visual perception models which assume that eye-movements are less influenced by cognitive (top-down) processes than by visual stimuli (bottom-up processes) during visual exploration tasks [Henderson and Hollingworth, 1998]. On the other hand, visual exploration in search of a graphical object matching a verbal specification is a task that combines complex, hence slow, cognitive processes with perceptual activities. These differences may explain why longer selection times were observed in the OP condition than in the VP and MP conditions, and why spatial oral information did not noticeably reduce selection times in the MP condition compared to the VP condition. This interpretation of the noticeable differences observed between average selection times in the three conditions, fits in with Rasmussen’s cognitive model [Rasmussen, 1986], provided that the search of a visually known target is assimilated to a skilled or automatic activity, and the search of a visually unknown target from a verbal specification of its characteristics as a problem-solving activity. The model then predicts that selection time of the known target will be shorter than that of the unknown target, based on the assumption that automatic or skilled responses to stimuli amount to the activation of precompiled schemata or the compilation and activation of pre-defined schemata respectively, while problem-solving involves more complex cognitive processes. The application of this model to the MP condition suggests the tentative conclusion that subjects used both sources of information optimally within an overall opportunistic strategy favouring visual search over more complex matching processes, such matching processes being activated only in cases when visual information was insufficient or ambiguous. However, although Rasmussen’s model constitutes an appropriate framework for predicting subjects’ performances, it is too general to provide any meaningful insight into the strategies underlying these performances and their implementation, that is, how multimodal stimuli are processed, how the results of modality specific processes are integrated and unified to produce meaningful coherent interpretations and appropriate actions or motor responses. Subjects’ performances are indeed compatible with cognitive models of multimodal input processing that postulate the existence of high level interactions between unimodal perceptual and interpretative processes rather than
Do Oral Messages Help Visual Search?
145
early low-level interactions; see, for instance, the model proposed in [Engelkamp, 1992]. However, further research is needed in order to determine which type of interaction (competition, synergism or complementarity) would best account for subjects’ results within the framework of our experimental protocol. For instance, reducing progressively the duration of visual target presentations in the MP condition appears as a promising, but difficult to implement, experimental paradigm for increasing our knowledge of the cognitive processes involved in the processing of multimodal inputs [Massaro, 2002]. Future work. To achieve significant advances in the investigation of multimodal perception and interpretation processes, experimental approaches have to overcome major difficulties. In particular, homogeneous sets of visual search tasks are needed in order to make it possible to achieve meaningful reliable intra- and inter-subject comparisons, as the same scene cannot be processed in the three conditions by the same subject. The “difficulty” of the visual search tasks proposed to our subjects varied greatly from one scene to another. For instance, in the VP condition, all subjects (6) failed to select the correct target for one scene, and only 8 scene+target pairs occasioned 26 out of the 31 errors observed in this condition. We are currently refining our visual and semiotic characterizations of scenes and targets in order to be capable of defining and generating sets of really equivalent scenes. The number of participants and scenes should also be increased considerably, so that a greater number of more sophisticated inter-related hypotheses can be tested simultaneously, and their soundness evaluated accurately, thanks to the possible application of appropriate statistical techniques. Ergonomic recommendations. To conclude, the quantitative results presented in this section contribute to validating hypotheses A and B. However, further research is needed to determine whether the significantly longer selection times observed in condition OP are mainly due either to the differences between the tasks subjects performed in the OP condition and those they carried out in the VP and MP conditions, or to the fact that subjects were unfamiliar with the tasks they carried out in the OP condition. These results also suggest useful recommendations for improving user interface design. In order to facilitate visual search tasks on crowded displays without resorting to standard visual enhancement techniques, two novel forms of user support may provide useful alternatives: a. If accuracy only is sought for, an oral message comprising an unambiguous verbal designation of the graphical target and spatial information on its location in the display will prove to be sufficient. b. If both accuracy and rapidity are sought for, a multimodal message will be more appropriate, that is a message comprising a context-free visual
146
Advances in Natural Multimodal Dialogue Systems
presentation of the target together with an oral message including the same information as mentioned in recommendation a. However, further experimental research is needed to confirm these recommendations beyond doubt, in-as-much as they have been inferred from a relatively small sample of experimental data and measurements. In addition, oral and multimodal messages should be compared, in terms of effectiveness and comfort, with other forms of user support, such as target visual enhancement through colour, animation, zooming, etc. Until a sufficient amount of experimental data has been collected, recommendations a. and b. should be considered as tentative. Analyses of results per class of scenes and type of messages, i.e., type of spatial information are presented next. These analyses make it possible to refine and enrich our initial working hypotheses.
3.1.2
Detailed analysis.
Results per class of scenes. Subjects’ results, grouped per scene class and target presentation mode, are presented in table 7.3. Error percentages were computed over: 96 samples per condition for class 1 (due to the exclusion of two class 1 scenes, as explained at the beginning of section 3); and 108 samples per condition for class 2. No statistical analysis was performed on these results by reason of the rather small number of samples in each set. Multimodal messages proved to be most effective, in comparison with visual and oral presentations, especially for scenes representing symbolic or arbitrary objects. For scenes in class 1, comparisons between the three conditions indicate that errors were reduced in the MP condition by 12.6% (compared to the VP condition) and 5.3% (compared to the OP condition), while average selection times decreased by 0.24 and 1.27 seconds respectively. In short, concerning class 1 scenes, 86% of the selection errors observed in the VP condition did not occur in the MP condition, and selection times were one third longer in the OP condition than in the MP condition. As for realistic scenes, the average selection time in the MP condition is similar to the VP one (2.40 sec. versus 2.43 sec.), and markedly inferior to the OP one (3.58 sec.), whereas the number of selection errors is similar to the OP one, and inferior to the VP one (by 10.2%). These results confirm the main assumption proposed in the previous subsection for interpreting global results, namely that, in the MP condition, subjects took advantage of both the visual and oral information available, especially for
147
Do Oral Messages Help Visual Search? Table 7.3. Results per target presentation mode and class of scenes. Target presentation mode
Error percentage
Average selection time (sec.)
Standard deviation (sec.)
VP-C1 VP-C2 OP-C1 OP-C2 MP-C1 MP-C2
14.6 15.7 7.3 6.5 2 5.5
3.27 2.43 4.30 3.58 3.03 2.40
1.94 1.39 4.09 2.87 2.36 1.36
Percentages were computed over the total number of samples (target selection tasks) per condition.
processing class 1 scenes. For instance, we observed that some subjects lacked the a priori knowledge required for taking advantage of the implicit information conveyed by ISI messages; visual information can prove to be most helpful for achieving successful target identification in such cases. On the other hand, class 1 scenes which consisted in collections of symbolic or arbitrary graphical objects (e.g. flags or geometric figures) favoured visual confusions between targets and similar objects in the displayed collections; verbal designations and spatial information undoubtedly helped subjects to solve possible visual “ambiguities”. The fact that average selection times were consistently longer for class 1 scenes than for class 2 scenes can be explained as follows. If the target is a familiar object (such as a pan) in a familiar realistic scene (a kitchen, for instance), visual exploration of the scene is facilitated by a priori knowledge of the standard structure of such environments and the likely locations of the target object therein. Such knowledge is not available in the case of unrealistic class 1 scenes; the structure of such a scene and the possible locations of the target in it cannot be foreseen using a priori knowledge, so that a more careful search, or even an exhaustive exploration, of the scene is necessary for succeeding in locating the target. This assumption may also explain why multimodal target presentations proved to be most effective for scenes belonging to class 1: both oral and visual information contributed to compensate for the lack of pragmatic a priori knowledge.
Five categories of verbal messages were Results per type of messages. experimented (cf. Subsection 2.3). Messages were classified according to the type of spatial information they comprised: absolute (ASI), relative (RSI), implicit (ISI), plus absolute-relative (ASI+RSI) and absolute-implicit (ASI+ISI).
148
Advances in Natural Multimodal Dialogue Systems
Subjects’ results, grouped according to these categories, are presented in table 7.4. Subjects’ performances in the VP condition are also reported, although messages were excluded from target presentations in this condition; they serve as references in the assessment of the effectiveness and efficiency of the user support provided by oral messages in the two other conditions (OP and MP). Percentages were computed, for each condition, over 48 samples (ASI), 72 samples (RSI), 24 samples (ISI), 36 samples (ASI+RSI), and 12 samples (ASI+ISI)10 . For each category of messages, comparisons between results achieved by subjects in the VP, OP and MP situations suggest that absolute and/or relative spatial information improved selection accuracy markedly (cf. the ASI, RSI and ASI+RSI types of messages). However, the usefulness of ISI messages seems questionable, at least in the OP condition. Their effectiveness in the MP condition denotes the complexity of the interpretation processes at work in the interpretation of multimodal stimuli. Average selection times for scenes associated with RSI and ASI messages were much longer in the OP condition (4.03 and 6.12 respectively) than in the other conditions (i.e. 2.86 and 1.84 for the VP condition, 3.12 and 2.1 for the MP condition). For RSI messages, this effect may be due to the complexity of the visual search strategy induced by relative spatial information when the target is unknown visually. In such cases, the search strategy is likely to include two successive steps: first, localization of the reference object, then exploration of its vicinity in search of the target [Gramopadhye and Madhani, 2001]. Each step being longer than the single step search for a visually known target, the resulting global selection time is necessarily much longer than in the other conditions where the target is visually known. This interpretation may also explain why RSI messages did not affect selection times in the MP condition noticeably. The target being in the vicinity of the reference object and having been viewed previously, it can be recognized through peripheral vision, so that one eye fixation only is required for locating both the reference object and the target, cf. [van Diepen et al., 1998]. However, it is also possible that, in the MP condition, subjects tended to adopt a simpler search strategy based exclusively on the available visual information (hence comprising a single visual search step) whenever the oral message induced a complex slow selection strategy. This second interpretation has the advantage of explaining why both RSI and ISI messages exerted no perceptible influence on selection times in the MP condition.
is 192 samples instead of 6 × 34 = 204 samples (cf. the two scenes that were excluded). Two additional scenes (6 x 2 samples) were not taken into account because the corresponding messages did not include any spatial information, the visual saliency of the target making such information superfluous.
10 That
149
Do Oral Messages Help Visual Search? Table 7.4. Results per target presentation mode and type of verbal message. VP
Condition
(reference)
Scenes grouped per message type
Percentage of errors
Average selection time (sec.)
Standard deviation (sec.)
ASI RSI ISI ASI+RSI ASI+ISI
10.4 26.4 4.17 13.89 8.33
2.87 2.86 1.84 3.57 3.54
1.19 2.14 0.57 1.99 0.98
2.91 4.03 6.12 3.82 5.19
3.41 5.94 3.78 3.78 3.37
2.42 3.12 2.1 2.82 2.98
1.41 2.43 1.06 1.84 2.53
OP ASI RSI ISI ASI+RSI ASI+ISI
0 8.33 16.67 5.56 16.67 MP
ASI RSI ISI ASI+RSI ASI+ISI
Condition
4.17 8.33 0 0 0
Condition
Best results in the OP condition are in bold type, and lowest ones are underscored. Percentages were computed over the total number of samples per message type. (Results in the VP condition are reported, VP being used here as the reference condition).
As for ISI messages, their exploitation involves complex cognitive processes which may have slowed down target selection in the OP condition. The poor results observed in the OP condition might also suggest that the usefulness of ISI messages is intrinsically limited. To sum up, whereas the inclusion of any category of verbal spatial information in multimodal target presentations seems worthwhile, absolute spatial information should be preferred over other information types in the design of spatial information messages, in order to effectively facilitate visual target localization and improve both selection times and accuracy. However, these observations, which refine hypothesis C (cf. Subsection 2.1), should be viewed as working hypotheses rather than reliable, or even tentative, conclusions. Their appropriateness has to be assessed through careful systematic experimentation on a large scale. Analysis of eye-movements (by means of an eye-tracker) would provide invaluable information on visual search strategies, especially
150
Advances in Natural Multimodal Dialogue Systems
on the exact influence of the information content of oral messages on these strategies.
3.2
Qualitative Analyses: Selection Errors
Qualitative analyses were focused on the subjects’ errors exclusively, with a view to: getting a better understanding of the contribution of verbal messages to assisting users in visual search tasks; and obtaining useful knowledge for improving message design. Analyses use the detailed characterizations of scenes and messages listed in Subsection 2.6, as well as the subjects’ subjective ratings of the difficulty of the prescribed visual search tasks (cf. the post-session questionnaires mentioned in Subsection 2.5). Scenes were filtered so that, in each condition, only the scenes which had occasioned more than one error were considered, on the basis of the following assumption: for a given scene in a given condition, the reasons for the failure of one single subject are more likely to originate from the subject’s capabilities than from the scene characteristics or the message information content.
3.2.1 Visual condition. The main plausible factors at the origin of selection errors observed in the VP condition are presented next. Percentages represent: the number of errors which a given characteristic of the scene may explain, by itself or in conjunction with other factors; computed over the total number of filtered errors (i.e. 26). Factors are listed in decreasing order of the percentages of errors they contribute to account for: Concerning targets: lack of salience (85%), eccentric position in the scene (69%), possible confusions with other objects (69%), unfamiliarity (50%); Concerning scenes: crowded (69%), unstructured (46%), figuring geometric forms (42%). This analysis of subjects’ errors in the VP condition will be used as a reference in the next subsection which is focused on errors in the OP and MP conditions.
Do Oral Messages Help Visual Search?
151
3.2.2 Oral and multimodal conditions. Five scenes in the OP condition and only two scenes in the MP condition occasioned more than one error, against eight in the VP condition. In addition, 24 errors in the VP condition were “corrected” in the OP condition, so that seven out of the eight scenes occasioning more than one error in the VP condition yielded error-free results in the OP condition. These comparisons bring out the usefulness of oral messages for improving target selection accuracy. However, four scenes yielding error-free results in the VP and MP conditions occasioned ten out of the twelve11 filtered errors observed in the OP condition12 . Therefore, it is likely that the main factor at the origin of these errors is the poor quality of the information content of the oral messages paired off with these scenes. The analysis of these four messages, together with the information provided by questionnaires and debriefings, support this conclusion. Four errors were motivated by an ISI message which referred to knowledge unfamiliar to the majority of subjects. A too complex ASI+ISI message (structure and length) referring to knowledge some subjects were unfamiliar with may account for two other errors. As for the two remaining pairs of errors, they may be reliably ascribed to the use, in both verbal target designations, of technical substantives the exact meanings of which were unfamiliar to some subjects. The fact that none of these errors occurred in the MP condition, together with the fact that two scenes only occasioned the four13 filtered errors observed in this condition, illustrates the advantages of combining visual and verbal information in target presentations. In the MP condition, two errors occurred on a “difficult” scene which occasioned six errors in the VP condition (crowded scene, and non salient unfamiliar target easy to confound with other objects), and two errors in the OP condition (use of technical vocabulary). The other two errors were occasioned by a scene which was processed successfully by all subjects in the OP condition, but occasioned three errors in the VP condition. This may hint that the processing of multimodal incoming information is controlled or influenced by visual perception strategies rather than by high level cognitive processes. In short, the qualitative analysis of errors confirms the usefulness of oral messages for improving the accuracy of visual target identification, provided that: messages are short, their syntactical structure straightforward, the vocabulary familiar to users; and above all, their information content is appropriate. 11 The
total number of errors in the OP condition amounted to 14. patterns for these images were as follows: 4, 2, 2, 2. 13 Out of 8. 12 Error
152
Advances in Natural Multimodal Dialogue Systems
Table 7.5. Subjects’ judgments on task difficulty and preferences. Condition
“Very easy”
“Easy”
“Difficult”
“Very difficult”
VP OP MP
22% 22% 72%
28% 61% 17%
39% 17% 11%
11% 0% 0%
Top-ranking preference
VP
OP
MP
17%
17%
66%
Percentages were computed over the total number of subjects. Highest values in each line are in bold type.
3.2.3 Subjects’ subjective judgements. Subjects expressed positive judgments on the contribution of oral messages to facilitating visual search tasks in the post-experiment questionnaires. The MP condition achieved the highest rate of subjective satisfaction as shown in table 7.5. The majority of subjects (72%) rated the execution of visual search tasks in the MP condition as “very easy”, whereas a minority of 22% only applied this rating to the OP and VP conditions. The VP condition got the lowest ratings. In addition, the MP condition, compared to the VP and OP conditions, was judged most efficient, in terms of rapidity, by fourteen subjects. Thus, the majority of subjects considered implicitly that oral messages had helped them to achieve the prescribed search tasks, especially when verbal information was associated with a visual presentation of the target. Finally, the majority of subjects (66%) preferred the MP condition to the others. Nevertheless, three subjects complained of the content or wording of some oral messages. Criticisms concern, for one subject, messages without spatial information (two messages), for the second one, the use of a somewhat technical word outside his vocabulary (one message), and for the last one, the length and complexity of RSI messages. In addition, a fourth subject considered oral messages were not useful in the MP condition. To conclude, the MP condition came first in the subjective judgments of most subjects, as regards both the utility and usability of oral messages. These results are encouraging in view of the numerous potential interactive applications involving visual search tasks. However, voluntary participants in experimental evaluations of novel artefacts or techniques are prone to judge them positively (experimental bias), especially if the evaluation consists in a single session like ours. Their judgments may evolve under the influence of experience. Further usability studies are needed, in particular for assessing how future users will appraise the support provided by oral messages after extensive practice in real contexts of use.
Do Oral Messages Help Visual Search?
4.
153
Conclusion
A preliminary experimental study has been presented, which aims at eliciting the contribution of oral messages to facilitating visual search tasks on crowded visual displays. Results of quantitative and qualitative analyses suggest that appropriate verbal messages can improve both search accuracy and selection times. In particular, multimodal messages including absolute spatial oral information on the target location in the visual scene, together with a visual presentation of the isolated target, are most effective. This type of messages also got the highest subjective satisfaction ratings from most subjects. Subjects’ acceptance, even in an experimental environment, is a valuable asset. Numerous potential applications exist. Facilitating visual search could indeed improve the efficiency and usability of present human-computer interaction sensibly, as direct manipulation of GUIs is the prevalent user interface design paradigm for the moment. However, these results are only tentative, by reason of the relatively small number of subjects involved in the experiment (18), the limited number of scenes they had to process (12 per condition), and the coarseness of the measures used which were restricted to search accuracy and selection times (including search, identification, and selection of the target). In addition, qualitative analyses (focused on subjects’ errors) suggest the possible influence, on subjects’ results, of factors that were not systematically taken into account in the design of our experimental protocol. Our current short term research directions are based on these observations. We are currently planning a series of experimental studies focused on indepth investigation of the possible influence, on search accuracy and selection times, of the visual characteristics of scenes and targets, namely scene structure or target position. In particular, structure seems a factor capable of facilitating the exploration of class 1 scenes significantly, see [Cribbin and Chen, 2001a], whereas the efficiency of relative spatial information can be influenced by the proximity of the target to the salient object [Gramopadhye and Madhani, 2001]. Each of these experiments will be designed along the same lines as the one presented here, and implemented using a similar experimental protocol. However, it will address a few specific related issues and involve a large number of participants, in order to make it possible to refine and enrich the results of the initial study presented here. In addition, to achieve sound meaningful interpretations of future quantitative experimental results, we shall compare them with qualitative eye-tracking data obtained from a carefully selected sample of participants. We believe that a better understanding of visual search strategies, of their inter-individual diversity [Cribbin and Chen, 2001b], their sensitivity to the visual characteristics
154
Advances in Natural Multimodal Dialogue Systems
of the scenes displayed, and their evolution under the influence of oral support, will prove useful for improving the design of oral user support in visual search tasks. In particular, such knowledge could help designers to tailor the information content and wording of oral messages to individual search strategies and visual scene characteristics. A further step may be to compare, in terms of effectiveness and usability, oral support to visual search with various visual enhancements of targets.
References Ahlberg, C., Williamson, C., and Shneiderman, B. (1992). Dynamic Queries for Information Exploration: An Implementation and Evaluation. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 619–626, New York. ACM Press. André, E. (1997). WIP and PPP: A Comparison of two Multimedia Presentation Systems in Terms of the Standard Reference Model. Computer Standards and Interfaces, 18(6-7):555–564. André, E. and Rist, T. (1993). The Design of Illustrated Documents as a Planning Task. In Maybury, M. T., editor, Intelligent Multimedia Interfaces, pages 96–116. Menlo Park, CA: AAAI/MIT Press. Baber, C. (2001). Computing in a Multimodal World. In Proceedings of the First International Conference on Universal Access in Human-Computer Interaction (UAHCI), pages 232–236. Mahwah, NJ: Lawrence Erlbaum. Bernsen, N. O. (1994). Foundations of Multimodal Representations, a Taxonomy of Representational Modalities. Interacting with Computers, 6:347– 371. Bieger, G. R. and Glock, M. D. (1986). Comprehending Spatial and Contextual Information in Picture-Text Instructions. Journal of Experimental Education, 54:181–188. Burhans, D. T., Chopra, R., and Srihari, R. K. (1995). Domain Specific Understanding of Spatial Expressions. In Proceedings of the IJCAI Workshop on the Representation and Processing of Spatial Expressions, pages 33–40, Montréal, Canada. Card, S. K. and Mackinlay, J. (1997). The Structure of the Information Visualization Design Space. In Proceedings of IEEE Symposium on Information Visualization (InfoVis), pages 92–99, Phoenix, AZ. IEEE Computer Society Press. Card, S. K., Mackinlay, J., and Shneiderman, B. (1999). Readings in Information Visualization – Using Vision to Think. San Francisco (CA): Morgan Kaufmann Publishers. Chelazzi, L. (1999). Serial Attention Mechanisms in Visual Search: A Critical Look at the Evidence. Psychological Research, 62(2-3):195–219.
Do Oral Messages Help Visual Search?
155
Coutaz, J. and Caelen, J. (1991). A Taxonomy for Multimedia and Multimodal User Interfaces. In Proceedings of the First ERCIM Workshop on Multimodal HCI, pages 143–148, INESC, Lisbon. Coutaz, J., Nigay, L., Salber, D., Blandford, A., May, J., and Young, R. (1995). Four Easy Pieces for Assessing the Usability of Multimodal Interaction: The CARE Properties. In Proceedings of INTERACT, pages 115–120, Lillehammer, Norway. Cribbin, T. and Chen, C. (2001a). A Study of Navigation Strategies in SpatialSemantic Visualizations. In Proceedings of the Ninth International Conference on Human-Computer Interaction (HCI International), pages 948–952. Mahwah, NJ: Lawrence Erlbaum. Cribbin, T. and Chen, C. (2001b). Exploring Cognitive Issues in Visual Information Retrieval. In Proceedings of INTERACT, pages 166–173, Tokyo, Japan. Amsterdam: IOS Press. de Vries, G. and Johnson, G.I. (1997). Spoken Help for a Car Stereo: An Exploratory Study. Behaviour and Information Technology, 16(2):79–87. Engelkamp, J. (1992). Modality and Modularity of the Mind. In Actes du 5ème Colloque de l’ARC ’Percevoir, Raisonner, Agir – Articulation de Modèles Cognitifs’, pages 321–343, Nancy. Faraday, P. and Sutcliffe, A. (1997). Designing Effective Multimedia Presentations. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 272–278. New York: ACM Press & Addison Wesley. Fekete, J.-D. and Plaisant, C. (2002). Interactive Information Visualization of a Million Items. In Proceedings of IEEE Symposium on Information Visualization (InfoVis), pages 117–124. Boston, MA: IEEE Press. Findlay, J. M. and Gilchrist, D. (1998). Eye Guidance and Visual Search. In Underwood, G., editor, Eye Guidance in Reading and Scene Perception, chapter 13, pages 295–312. Amsterdam: Elsevier. Frank, A. (1998). Formal Models for Cognition - Taxonomy of Spatial Location Description and Frames of Reference. In Freksa, C., Habel, C., and Wender, K. F., editors, Spatial Cognition - An Interdisciplinary Approach to Representing and Processing Spatial Knowledge, volume 1, pages 293–312. Berlin: Springer Verlag. Grabowski, N. A. and Barner, K. E. (1998). Data Visualisation Methods for the Blind Using Force Feedback and Sonification. In Proceedings of the SPIE Conference on Telemanipulator and Telepresence Technologies, volume 3524, pages 131–139. Bellingham, WA: International Society for Optical Engineering (SPIE). Gramopadhye, A. K. and Madhani, K. (2001). Visual Search and Visual Lobe Size. In Arcelli, C., editor, Proceedings of Fourth International Conference on Visual Form, Lecture Notes in Computer Science, volume 2059, pages 525–531. Berlin Heidelberg: Springer Verlag.
156
Advances in Natural Multimodal Dialogue Systems
Henderson, J. M. and Hollingworth, A. (1998). Eye Movements during Scene Viewing: an Overview. In Underwood, G., editor, Eye Guidance in Reading and Scene Perception, chapter 12, pages 269–293. Amsterdam: Elsevier. Lamping, J., Rao, R., and Pirolli, P. (1995). A Focus + Context Technique Based on Hyperbolic Geometry for Visualizing Large Hierarchies. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 401–408. New York: ACM Press & Addison Wesley. Massaro, D. (2002). Personal communication. Maybury, M. T. (2001). Universal Multimedia Information Access. In Proceedings of the First International Conference on Universal Access in HumanComputer Interaction (UAHCI), pages 382–386. Mahwah, NJ: Lawrence Erlbaum. Maybury, M.T. (1993). Planning Multimedia Explanations Using Communicative Acts. In Maybury, M.T., editor, Intelligent Multimedia Interfaces, pages 59–74. Menlo Park, CA: AAAI/MIT Press. Nigay, L. and Coutaz, J. (1993). A Design Space for Multimodal Systems: Concurrent Processing and Data Fusion. In Proceedings of INTERCHI, pages 172–178. New York: ACM Press & Addison Wesley. Oviatt, S., DeAngeli, A., and Kuhn, K. (1997). Integration and Synchronisation of Input Modes during Multimodal Human-Computer Interaction. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 415–422. New York: ACM Press & Addison Wesley. Rasmussen, J. (1986). Information Processing and Human-Machine Interaction: An Approach to Cognitive Engineering. Amsterdam: North-Holland. Robbe, S., Carbonell, N., and Dauchy, P. (2000). Expression Constraints in Multimodal Human-Computer Interaction. In Proceedings of the International Conference on Intelligent User Interfaces (IUI), pages 225–229. New York: ACM Press. Shneiderman, B. (1983). Direct Manipulation: A Step Beyond Programming Languages. IEEE Computer, 16:57–69. Shneiderman, B. (1996). The Eyes Have It: A task by Data Type Taxonomy for Information Visualizations. In Proceedings of IEEE Workshop on Visual Languages (VL), pages 336–343, Boulder, CO. Sidner, C. L. and Dzikovska, M. (2002). Hosting Activities: Experience with and Future Directions for a Robot Agent Host. In Proceedings of the International Conference on Intelligent User Interfaces (IUI), pages 143–150. New York: ACM Press. Stock, O., Strappavara, C., and Zancanaro, M. (1997). Explorations in an Environment for Natural Language Multimodal Information Access. In Maybury, M. T., editor, Intelligent Multimedia Information Retrieval, pages 381– 398. Menlo Park, CA: AAAI/MIT Press.
Do Oral Messages Help Visual Search?
157
van Diepen, M. J., Wampers, M., and d’Ydewall, G. (1998). Functional Division of the Visual Field: Moving Masks and Moving Windows. In Underwood, G., editor, Eye Guidance in Reading and Scene Perception, chapter 15, pages 337–355. Amsterdam: Elsevier. van Mulken, S., André, E., and Muller, J. (1999). An Empirical Study on the Trustworthiness of Life-Like Interface Agents. In Proceedings of the Eighth International Conference on Human-Computer Interaction (HCI International), pages 152–156. Mahwah, NJ: Lawrence Erlbaum. Yu, W., Ramloll, R., and Brewster, S. A. (2001). Haptic Graphs for Blind Computer Users. Haptic Human-Computer Interaction, 2058:41–51.
Chapter 8 GEOMETRIC AND STATISTICAL APPROACHES TO AUDIOVISUAL SEGMENTATION Trevor Darrell, John W. Fisher III, Kevin W. Wilson, and Michael R. Siracusa Computer Science and Artificial Intelligence Laboratory M.I.T., Cambridge, MA 02139, USA
{trevor, fisher, kwilson, siracusa}@csail.mit.edu Abstract
Multimodal approaches are proposed for segmenting multiple speakers using geometric or statistical techniques. When multiple microphones and cameras are available, 3-D audiovisual tracking is used for source segmentation and array processing. With just a single camera and microphone, an information theoretic criteria separates speakers in a video sequence and associates relevant portions of the audio signal. Results are shown for each approach, and an initial integration effort is discussed.
Keywords:
Source separation, vision tracking, microphone array, mutual information.
1.
Introduction
Conversational dialog systems have become practically useful in many application domains, including travel reservations, traffic information, and database access. However most existing conversational speech systems require tethered interaction, and work primarily for a single user. Users must wear an attached microphone or speak into a telephone handset, and must do so one at a time. This limits the range of use of dialog systems, since in many applications users might expect to freely approach and interact with a device. Worse, they may wish to arrive as a group, and talk among themselves while interacting with the system. To date it has been difficult for speech recognition systems to handle such conditions and correctly recognize the utterances intended for the device. Given only a single sensing modality, and perhaps only a single sensor, disambiguating the audio from multiple speakers can be a challenge. But with 159 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 159–180. © 2005 Springer. Printed in the Netherlands.
160
Advances in Natural Multimodal Dialogue Systems
multiple modalities, and possibly multiple sets of sensors, segmentation can become feasible. In this chapter we present two methods for audiovisual segmentation of multiple speakers, based on geometric and statistical source separation approaches. We have explored two configurations which are of practical interest. The first is based on a “smart environment” or “smart room” enabled with multiple stereo cameras and a ceiling mounted large-aperture microphone array grid. In this configuration users can move arbitrarily through the room or environment while focused audiovisual streams are generated from their appearance and utterances. In the second configuration we presume a single omnidirectional microphone and single camera is available. This is akin to what one might find on a PDA or cell phone, or low-cost PC videoconferencing installation. In a multisensor environment, we use a geometric approach, and use multiview image correspondence and tracking methods combined with acoustic beamforming techniques. A multimodal approach can track sources even in acoustically reverberant environments with dynamic illumination, conditions that are challenging for audio or video processing alone. When only a single multimodal sensor pair (audio and video) is available, we use a statistical approach, jointly modelling audio and video variation to identify cross-modal correspondences. We show how this approach can detect which user is speaking when several are facing a device. This allows the segregation of users’ utterances from each other’s speech, and from background noise events. We first review related work, and then present our method for geometric source separation and vision-guided microphone array processing. We then describe our single-camera/microphone method for audiovisual correspondence using joint statistical processing. We show results with each of these techniques, and describe our initial integration effort.
2.
Related Work
Humans routinely perform tasks in which ambiguous auditory and visual data are combined in order to support accurate perception. In contrast, automated approaches for statistical processing of multimodal data sources lag far behind. This is primarily due to the fact that few methods adequately model the complexity of the audio/visual relationship. Classical approaches to multimodal fusion at a signal processing level often either assume a statistical relationship which is too simple (e.g. jointly Gaussian) or defer fusion to the decision level when many of the joint (and useful) properties have been lost. While such pragmatic choices may lead to simple statistical measures, they do so at the cost of modelling capacity.
Geometric and Statistical Approaches to Audiovisual Segmentation
161
An information theoretic approach motivates fusion at the measurement level without regard to specific parametric densities. The idea of using information theoretic principles in an adaptive framework is not new (e.g. see [Deco and Obradovic, 1996] for an overview) with many approaches suggested over the last 30 years. A critical distinction in most information theoretic approaches lies in how densities are modelled (either explicitly or implicitly), how entropy (and by extension mutual information) is approximated or estimated, and the types of mappings which are used (e.g. linear vs. nonlinear). Approaches which use a Gaussian assumption include [Plumbley and Fallside, 1988; Plumbley, 1991] and [Becker, 1992]. Additionally, Becker [1992] applies the method to fusion of artificially generated random dot stereograms. There has been substantial progress on feature-level integration of speech and vision. For example, Meier et al. [1999], Wolff et al. [1994] and others have built visual speech reading systems that can improve speech recognition results dramatically. Our system, described below, is designed to be able to detect and disambiguate cases where audio and video signals are coming from different sources. Other audio/visual work which is closely related to ours is that of Hershey and Movellan [1999], which examined the per-pixel correlation relative to an audio track, detecting which pixels have related variation. Again, an inherent assumption of this method was that the joint statistics were Gaussian. Slaney and Covell [2000] looked at optimizing temporal alignment between audio and video tracks using canonical correlations (equivalent to mutual information in the jointly Gaussian case), but did not address the problem of detecting whether two signals came from the same person. Several authors have explored geometric approaches to audiovisual segmentation using array processing techniques. Microphone arrays are a special case of the more general problem of sensor arrays, which have been studied extensively in the context of applications such as radar and sonar [van Veen and Buckley, 1988]. The Huge Microphone Array project [Silverman et al., 1998] is investigating the use of very large arrays containing hundreds of microphones. Their work concentrates on audio-only solutions to array processing. Another related project is Wang and Brandstein’s audio-guided active camera [Wang and Brandstein, 1999], which uses audio localization to steer a camera on a pan/tilt base. A number of projects [Bub et al., 1995; Casey et al., 1995; Collobert et al., 1996] have used vision to steer a microphone array, but because they use a single camera to steer a far-field array, they cannot obtain or make use of full 3-D position information; they can only select sound coming from a certain direction. We are exploring both a geometric and statistical approach to audiovisual segmentation. In the next section, we describe our geometric approach based on microphone and camera arrays. Following that, we present our statistical
162
Advances in Natural Multimodal Dialogue Systems
approach using an information theoretic measure to relate single channel audio and video signals.
3.
Multimodal Multisensor Domain
The association between sound and location makes a microphone array a powerful tool for audiovisual segmentation. In combination with additional sensors and contextual information from the environment, a microphone array can effectively amplify and separate sounds of interest from complex background noise. To focus a microphone array, the location of the speaker(s) of interest must be known. A number of techniques exist for localizing sound sources using only acoustic cues [Viberg and Krim, 1997], but the performance of these localization techniques tends to degrade significantly in the presence of reverberation and/or multiple sound sources. Unfortunately, most common office and meeting room environments are highly reverberant, with reflective wall and table surfaces, and will normally contain multiple speakers. However, in a multimodal setting we can take advantage of other sensors in the environment to perform localization of multiple speakers despite reverberation. We use a set of cameras to track the position of speakers in the environment, and report the relative geometry of speakers, cameras, and microphones. The vision modality is not affected by acoustic reverberation, but its accuracy will depend on the calibration and segmentation procedures. In practice we use video information to restrict the range of possible acoustic source locations to a region small enough to allow for acoustic localization techniques to operate without severe problems with reverberation and multiple speakers.
3.1
Microphone Array Processing Overview
Many problems can be addressed through array processing. The two array processing problems that are relevant to our system are beamforming and source localization. Beamforming is a type of spatial filtering in which the signals from individual array elements are filtered and added together to produce an output that amplifies signals coming from selected regions of space and attenuates sounds from other regions of space. In the simplest form of beamforming, delay-andsum beamforming, each channel’s filter is a pure delay. The delay for each channel is chosen such that signals from a chosen “target location” are aligned in the array output. Signals from other locations will tend to be combined incoherently. Source localization is a complementary problem to beamforming whose goal is to estimate the location of a signal source. One way to do this is to beamform to all candidate locations and to pick the location that yields the
Geometric and Statistical Approaches to Audiovisual Segmentation
163
Figure 8.1. Array power response as a function of position (two speakers). This plot shows the array output power as the array’s focus is scanned through a plane centred on one speaker while another speaker is nearby. The central speaker is easily discernible in the plot, but the peak corresponding to the weaker speaker is difficult to distinguish among the sidelobe peaks. Using vision-based person tracking cues can disambiguate this case.
strongest response. This method works well, but the amount of computation required to do a full search of a room is prohibitively large. Another method for source localization consists of estimating relative delays among channels and using these delays to calculate the location of the source. Delay-estimation techniques are computationally efficient but tend to perform poorly in the presence of multiple sources and/or reverberation. For microphone arrays that are small in size compared to the distance to the sources of interest, incoming wavefronts are approximately planar. Because of this, only source direction can be determined; source distance remains ambiguous. When the array is large compared to the source distance, the sphericity of the incoming wavefronts is detectable, and both direction and distance can be determined. These effects of array size apply both to localization and to beamforming, so if sources at different distances in the same direction must be
164
Advances in Natural Multimodal Dialogue Systems
Figure 8.2. The test environment. This figure shows a schematic view of the environment with stereo cameras represented by black triangles and microphones represented by empty circles. The single microphone used for comparison in the experiments is shown filled in.
separated, a large array must be used. As a result, with large arrays the signalto-noise ratio (for a given source) at different sensors will vary with source location. Because of this, signals with better signal-to-noise ratios should be weighted more heavily in the output of the array. Our formulation of the steering algorithm presented in Section 8.3.3 takes this into account.
3.2
Person Tracking Overview
Tracking people in known environments has recently become an active area of research in computer vision. Several person-tracking systems have been developed to detect the number of people present as well as their 3D position over time. These systems use a combination of foreground/background classification, clustering of novel points, and trajectory estimation over time in one or more camera views [Darrell et al., 2000; Krumm et al., 2000]. Colour-based approaches to background modelling have difficulty with illumination variation due to changing lighting and/or video projection. To overcome this problem, several researchers have supported the use of background models based on stereo range data [Darrell et al., 2000; Ivanov et al., 2000]. Unfortunately, most of these systems are based on computationally intense, exhaustive stereo disparity search. We have developed a system that can perform dense, fast, range-based tracking with modest computational complexity. We apply ordered disparity search techniques to prune most of the disparity search computation during foreground detection and disparity estimation, yielding a fast, illumination-insen-
Geometric and Statistical Approaches to Audiovisual Segmentation
165
sitive 3D tracking system. Details of our system are presented in [Darrell et al., 2001], and work on integrating audio more directly into the tracking is presented in [Checka et al., 2003]. Our system reports the 3-D position of people moving about an environment equipped with an array of stereo cameras.
3.3
Vision-Guided Acoustic Volume Selection
We perform both audio localization and beamforming with a large, ceilingmounted microphone array. Localization uses information from both audio and video, while beamforming uses only the audio data and the results of the localization processing. A large array allows us to select a volume of 3-D space, rather than simply form a 2-D beam of enhanced response as anticipated by the standard array localization algorithms. However, the usual assumption of constant target signal-to-noise ratio (SNR) across the array does not hold when the array geometry is large (when the array width is on the same scale as the target distance.) Our system uses the location estimate from the vision tracker as the initial guess from which to begin a gradient ascent search for a local maximum in beam output power. In our system, beam power is defined as the integral over a half-second window of the square of the output amplitude. The vision tracker is accurate to within less than one meter. Gradient ascent to the nearest local maximum can therefore be expected to converge to the location of the speaker of interest when no other speakers are very close by. For small microphone arrays, the relative SNRs of the individual channels do not vary significantly as a function of source location. This is, however, not true for larger microphone arrays. For our array, which is roughly 4 meters across, we must take into account the fact that some elements will have better signals than others. Specifically, if we assume that we have signals x1 and x2 which are versions of the unit-variance desired signal, s, that have been contaminated by unit-variance uncorrelated noise, we can analyze the problem as follows: x1 = a1 s + n1 x2 = a2 s + n2 In this model, the signal-to-noise ratios of x1 and x2 will be a21 and a22 , respectively. Their optimal linear combination will be of the form y = bx1 + x2 . Because of the uncorrelated noise assumption, the SNR of this combination will be (ba1 + a2 )2 b2 + 1 By taking the derivative of this expression with respect to b and setting the result equal to zero, one finds that the optimal value of b is: SN R(y) =
166
Advances in Natural Multimodal Dialogue Systems
Table 8.1. Audio-video localization performance.
Distant microphone Video only Audio only (dominant speaker) Audio-Video
a1 b= = a2
SNR (dB) −7.7 −6.1 −4.2 0.1
SN R(x1 ) SN R(x2 )
Individual elements’ signals should be scaled by a constant proportional to the square root of their SNRs. We use the location estimate to weight individual channels assuming a 1/r attenuation due to the spherical spreading of the source: an = 1/rn .
4.
Results
Our test environment, depicted in Figure 8.2, is a conference room equipped with 15 omnidirectional microphones spread across the ceiling and 2 stereo cameras on adjacent walls. The audio and video subsystems were calibrated independently, and for our experiments, we performed a joint calibration by finding the least-squares bestfit alignment between the two coordinate systems. Figure 8.1 is an example of what happens when multiple speakers are present in the room. Audio-only gradient ascent could easily find one of the undesirable local maxima. Because our vision-based tracker is accurate to within one meter, we can safely assume that we will find the correct local maximum even in the presence of interferers. To validate our localization and source separation techniques, we ran an experiment in which two speakers spoke simultaneously while one of them moved through the room. We tracked the moving speaker with the stereo tracker and processed the corresponding audio stream using three different localization techniques. For each, we used a reference signal collected with a close-talking microphone to calculate a time-averaged SNR (Table 8.1). For performance comparison, we use the signal from a single distant microphone near the centre of the room. This provides no spatial selectivity, but for our scenario it tends to receive the desired speech more strongly than the interfering speech. The SNR for the single microphone case is negative because of a combination of the interfering speaker and diffuse noise from the room’s ventilation system.
Geometric and Statistical Approaches to Audiovisual Segmentation
167
Table 8.2. Percent correct word recognition rates for male(female) speakers. Performance was computed based on 20 sentence queries from five male and two female speakers. The close-talking microphone was clipped to the lapel of the speaker. The microphone array is as described above. The distant microphone is one array element from near the centre of the room. Interferer Close-talking microphone Microphone array Distant microphone
None 95(96) 82(43) 73(35)
-24 dB 95(96) 80(41) 74(28)
-12 dB 95(95) 64(13) 51(7)
To evaluate the microphone array’s effects on recognition rates for automated speech recognition (ASR), we connected our system to the MIT Spoken Language Systems (SLS) Group’s JUPITER weather information system [Zue et al., 2000]. We had two speakers issue several weather-related queries from different locations in the room. As collected, the data contains quiet but audible noise from the ventilation system in the room. To evaluate the results under noisier conditions, additional noise was added to these signals. The results are shown in Table 8.2. As can be seen in the table, the beamformed signal from the microphone array was in all cases superior to the single distant microphone, but not as good as a close-talking microphone. The -12 dB interferer significantly degraded the performance of the array. We are currently working on adaptive null-steering algorithms that should improve performance in the presence of stronger interferers such as this.
5.
Single Multimodal Sensor Domain
In the single (multimodal) sensor domain, geometry is less useful, and array processing impossible; in this case we instead exploit audiovisual joint statistics to localize speakers. We adopt the paradigm of looking at a single camera view, and seeing what information from a single microphone can tell us about that view (and vice-versa.) We propose an independent cause model to capture the relationship between generated signals in each individual modality. Using principles from information theory and nonparametric statistics, we show how an approach for learning maximally informative joint subspaces can find cross-modal correspondences. We analyze the graphical model of multimodal generation and show under what conditions related subcomponents of each signal have high mutual information. Nonparametric statistical density models can be used to measure the degree of mutual information in complex phenomena [Fisher and Principe, 1998]. We apply these models to audio/visual data. This technique simultaneously learns projections of images in the video sequence and projections of sequences of periodograms taken from the audio sequence. The projections are computed
168
Advances in Natural Multimodal Dialogue Systems
adaptively such that the video and audio projections have maximum mutual information (MI). We first review the basic method for audio-visual fusion and information theoretic adaptive methods, see [Fisher and Darrell, 2002] for full details. We present our probabilistic model for cross-modal signal generation and show how audiovisual correspondences can be found by identifying components with maximal mutual information. In an experiment comparing the audio and video of every combination of a group of eight users, our technique was able to perfectly match the corresponding audio and video for each user. Finally, we show a new result on a monocular speaker segmentation task where we segment the audio between several speakers seen by the camera. These results are based purely on the instantaneous cross-modal mutual information between the projections of the two signals, and do not rely on any prior experience or model of user’s speech or appearance.
5.1
Probabilistic Models of Audio-Visual Fusion
We consider multimodal scenes which can be modelled probabilistically with one joint audiovisual source and distinct background interference sources for each modality. Each observation is a combination of information from the joint source, and information from the background interferer for that channel. In contrast with the array processing case, we explicitly model visual appearance variation, not just 3D geometry. We use a graphical model (Figure 8.3) to represent this relationship. In the diagrams, B represents the joint source, while A and C represent single modality background interference. Our purpose here is to analyze under which conditions our methodology should uncover the underlying cause of our observations. Figure 8.3a shows an independent cause model for our typical case, where {A, B, C} are unobserved random variables representing the causes of our (high-dimensional) observations in each modality {X a , X v }. In general there may be more causes and more measurements, but this simple case can be used to illustrate our algorithm. An important aspect is that the measurements have dependence on only one common cause. The joint statistical model consistent with the graph of figure 8.3a is P (A, B, C, X a , X v ) = P (A)P (B)P (C)P (X a |A, B)P (X v |B, C). Given the independent cause model, a simple application of Bayes’ rule (or the equivalent graphical manipulation) yields the graph of figure 8.3b which is consistent with P (A, B, C, X a , X v ) = P (X a )P (C)P (A, B|X a )P (X v |B, C), which shows that information about X a contained in X v is conveyed through the joint statistics of A and B. The consequence being that, in general, we
Geometric and Statistical Approaches to Audiovisual Segmentation
A
169
A Xa
B
Xa B
Xv C
Xv C
(a)
(b) a XA
A
a XB
a XB
B
v XB
B v XB v XC
C
(c)
a XB
B
v XB
(d)
Figure 8.3. Graphs illustrating the various statistical models exploited by the algorithm: (a) the independent cause model - X a and X v are independent of each other conditioned on {A, B, C}, (b) information about X a contained in X v is conveyed through joint statistics of A and B, (c) the graph implied by the existence of a separating function, and (d) two equivalent Markov chains which can be extracted from the graphs if the separating functions can be found.
cannot disambiguate the influences that A and B have on the measurements. A similar graph is obtained by conditioning on X v . Suppose decompositions of the measurement X a and X v exist such that the following joint densities can be written: P (A, B, C, X a , X v ) = a a v |A)P (XB |B)P (XB |B)P (XCv |C) P (A)P (B)P (C)P (XA a , X a ] and X v = [X v , X v ]. An example for our specific where X a = [XA B B C application would be segmenting the video image (or filtering the audio signal). In this case, we get the graph of Figure 8.3c and from that graph we can extract the Markov chain which contains elements related only to B. Figure 8.3d shows equivalent graphs of the extracted Markov chain. As a consequence, there is no influence due to A or C. Of course, we are still left with the formidable task of finding a decomposition, but given the decomposition it can be shown, using the data processing inequality [Cover and Thomas, 1991], that the following inequality holds: a v a , XB ) ≤ I(XB , B) I(XB a v v I(XB , XB ) ≤ I(XB , B)
(8.1) (8.2)
a and X v (e.q. More importantly, these inequalities hold for functions of XB B a a v v Y = f (X ; ha ) and Y = f (X ; hv )). Consequently, by maximizing the
170
Advances in Natural Multimodal Dialogue Systems
mutual information between I(Y a ; Y v ), we must necessarily increase the mutual information between Y a and B and Y v and B. The implication is that fusion in such a manner discovers the underlying cause of the observations; that is, the joint density of p(Y a , Y v ) is strongly related to B. Furthermore, with an approximation, we can optimize this criterion without estimating the separating function directly. In the event that a perfect decomposition does not exist, it can be shown that the method will approach a “good” solution in the Kullback-Leibler sense. From the perspective of information theory, estimating separate projections of the audio video measurements which have high mutual information makes intuitive sense as such features will be predictive of each other. The advantage is that the form of those statistics is not subject to the strong parametric assumptions (e.g. joint Gaussianity) which we wish to avoid. We can find these projections using a technique that maximizes the mutual information between the projections of the two spaces. Following [Fisher et al., 2000], we use a nonparametric model of joint density for which an analytic gradient of the mutual information with respect to projection parameters is available. In principle the method may be applied to any function of the measurements, Y = f (X; h), which is differentiable in the parameters h (e.g. as shown in [Fisher et al., 2000]). We consider a linear fusion model which results in a significant computational savings at a minimal cost to the representational power (largely due to the nonparametric density modelling of the output):
v y1v · · · yN a y1a · · · yN
=
hTv 0T
0T hTa
xv1 · · · xvN xa1 · · · xaN
(8.3)
where xvi ∈ Nv and xai ∈ Na are lexicographic samples of images and periodograms, respectively, from an A/V sequence. The linear projection defined by hTv ∈ Mv ×Nv and hTa ∈ Ma ×Na maps A/V samples to low dimensional features yiv ∈ Mv and yia ∈ Ma . Treating xi and yi as samples from a random variable our goal is to choose hv and ha to maximize the mutual information, I (Y a ; Y v )), of the derived measurements. Mutual information indicates the amount of information that one random variable conveys on average about another. The usual difficulty of MI as a criterion for adaptation is that it is an integral function of probability densities. Furthermore, in general we are not given the densities themselves, but samples from which they must be inferred. We use a second-order entropy approximation with a nonparametric density estimator such that the gradient terms with respect to the projection coefficients can be computed exactly by evaluating a finite number of functions at a finite number of sample locations in the output space as shown in [Fisher and Principe, 1997; Fisher and Principe, 1998].
Geometric and Statistical Approaches to Audiovisual Segmentation
171
Figure 8.4. Video sequence contains one speaker and monitor which is flickering: (a) one image from the sequence, (b) pixel-wise image of standard deviations taken over the entire sequence, (c) image of the learned projection, hv , (d) image of hv for incorrect audio.
This method requires that the projection be differentiable, which it is in our case. Additionally some form of capacity control is necessary as the method results in a system of underdetermined equations. To address this problem, we impose an L2 penalty on the projection coefficients of ha and hv [Fisher and Darrell, 2002]. Furthermore, we impose the criterion that if we consider the projection hv as a filter, it has low output energy when convolved with images in the sequence (on average). This constraint is the same as that proposed by Mahalanobis et al. [1987] for designing optimized correlators, the difference being that in their case the projection output was designed explicitly while in our case it is derived from the MI optimization in the output space. For the full details of this method, see [Fisher et al., 2000; Fisher and Darrell, 2002].
5.2
Single Microphone and Camera Experiments
Our motivating scenario for this application is a group of users interacting with an anonymous handheld device or kiosk using spoken commands. Given a received audio signal, we would like to verify whether the person speaking the command is in the field of view of the camera on the device, and if so to localize which person is speaking.
172
Advances in Natural Multimodal Dialogue Systems
Simple techniques which check only for the presence of a face (or moving face) would fail when two people were looking at their individual devices and one spoke a command. Since interaction may be anonymous, we presume no prior model of the voice or appearance of users is available to perform the verification and localization. We collected audio-video data from eight subjects. In all cases the video data was collected at 29.97 frames per second at a resolution of 360x240. The audio signal was collected at 48KHz, but only 10Khz of frequency content was used. All subjects were asked to utter the phrase “How’s the weather in Taipei?”. This typically yielded 2–2.5 seconds of data. Video frames were processed as is, while the audio signal was transformed to a series of periodograms. The window length of the periodogram was 2/29.97 seconds (i.e. spanning the width of two video frames). Upon estimating projections, the mutual information between the projected audio and video data samples is used as the measure of consistency. All values for mutual information are in terms of the maximum possible value, which is the value obtained (in the limit) if the two variables are uniformly distributed and perfectly predict one another. In all cases, we assume that there is not significant head movement on the part of the speaker during the utterance of the sentence. While this assumption might be violated in practice, one might account for head movement using a tracking algorithm, in which case the algorithm as described would process the images after tracking. Figure 8.4a shows a single video frame from one sequence of data. In the figure, there is a single speaker and a video monitor. Throughout the sequence the video monitor exhibits significant flicker. Figure 8.4b shows an image of the pixel-wise standard deviations of the image sequence. As can be seen, the energy associated with changes due to monitor flicker is greater than that due to the speaker. Figure 8.6a shows the associated periodogram sequence where the horizontal axis is time and the vertical axis is frequency (0-10 Khz). Figure 8.4c shows the coefficients of the learned projection when fused with the audio signal. As can be seen the projection highlights the region about the speaker’s lips. Figure 8.5 shows results from another sequence in which there are two people. The person on the left was asked to utter the test phrase, while the person on the right moved their lips, but did not speak. This sequence is interesting in that a simple face detector would not be sufficient to disambiguate the audio and video stream. Figure 8.5b shows the pixel variance as before. There are significant changes about both subjects lips. Figure 8.5c shows the coefficients of the learned projection when the video is fused with the audio and again the region about the correct speaker’s lips is highlighted. In addition to localizing the audio source in the image sequence, we can also check for consistency between the audio and video. Such a test is useful in the case that the person to which a system is visually attending is not the
Geometric and Statistical Approaches to Audiovisual Segmentation
173
Figure 8.5. Video sequence containing one speaker (person on left) and one person who is randomly moving their mouth/head (but not speaking): (a) one image from the sequence, (b) pixel-wise image of standard deviations taken over the entire sequence, (c) image of the learned projection, hv , (d) image of hv for incorrect audio.
person who actually spoke. Having learned a projection which optimizes MI in the output feature space, we can then estimate the resulting MI and use that estimate to quantify the audio/video consistency. Using the sequences of figure 8.4 and 8.5, we compared the fusion result when using a separately recorded audio sequence from another speaker. The periodogram of the alternate audio sequence is shown in figure 8.6b. Figures 8.4d and 8.5d show the resulting hv when the alternate audio sequence is used. In the case that the alternate audio was used, we see that coefficients related to the video monitor increase significantly in 8.4d while energy is distributed throughout the image of 8.5d. For figure 8.4, the estimate of mutual information was 0.68 relative to the maximum possible value for the correct audio sequence. In contrast when compared to the periodogram of 8.6b, the value drops to 0.08 of maximum. For the sequence of figure 8.5, the estimate of mutual information for the correct sequence was 0.61 relative to maximum, while it drops to 0.27 when the alternate audio is used. Data was collected from six additional subjects for this experiment, and each video sequence was compared to each audio sequence. (No attempt was made to temporally align the mismatched audio sequences at a fine scale, but
174
Advances in Natural Multimodal Dialogue Systems
Figure 8.6. Gray scale magnitude of audio periodograms. Frequency increases from bottom to top, while time is from left to right. (a) audio signal for image sequence of figure 8.4. (b) alternate audio signal recorded from different subject.
Table 8.3. Summary of results over eight video sequences. The columns indicate which audio sequence was used while the rows indicate which video sequence was used. In all cases the correct audio/video pair had the highest relative MI score. v1 v2 v3 v4 v5 v6 v7 v8
a1 0.68 0.20 0.05 0.12 0.17 0.20 0.18 0.13
a2 0.19 0.61 0.27 0.24 0.05 0.05 0.15 0.05
a3 0.12 0.10 0.55 0.32 0.05 0.05 0.07 0.10
a4 0.05 0.11 0.05 0.55 0.05 0.13 0.05 0.05
a5 0.19 0.05 0.05 0.22 0.55 0.14 0.05 0.31
a6 0.11 0.05 0.05 0.05 0.05 0.58 0.05 0.16
a7 0.12 0.18 0.05 0.05 0.20 0.05 0.64 0.12
a8 0.05 0.32 0.05 0.10 0.09 0.07 0.26 0.69
they were coarsely aligned). Table 8.3 summarizes the results. The previous sequences correspond to subjects 1 and 2 in the table. In every case the matching audio/video pairs exhibited the highest mutual information after estimating the projections. Finally, we tested how this method can segregate speech of multiple users in a single field of view. In concert with a face detection module, it is possible to detect which user is speaking and whether they are facing the camera. The audiovisual mutual information method is able to match the visual speech motion with the acoustic signal, and ignore confounding motions of the other user’s head or other motions in the scene. Figure 8.7 shows the results of tracking two users speaking in turns in front of a single camera and microphone, and detecting which is most likely to be speaking based on the measured audiovisual consistency.
Geometric and Statistical Approaches to Audiovisual Segmentation
175
Figure 8.7. Top row presents four frames from a video sequence with two speakers in front of a single camera and microphone. Audiovisual consistency is measured using a mutual information criteria. In the first two frames the left person is speaking, while in the last two the right person is speaking. The consistency measure shown in the bottom row for each frame correctly detects who is speaking.
6.
Integration
We have shown separately how geometric and statistical approaches can be used to solve audiovisual segmentation tasks and enable untethered conversational interaction. The geometric approach used 3-D tracking and array processing and ignored appearance variation. The statistical approach used a mutual information analysis of appearance and spectral variation, and ignored 3-D geometry. While each approach is already valuable in the intended domain, it is clear that they would benefit from combination. Our initial integration effort, described in detail in [Siracusa et al., 2003], combines geometric and statistical insights in a system for determining who is speaking in its environment. It uses multiple audio and video cues and allows for robust front-end processing in a multi-person conversational interface. In the system, data from a stereo camera and small, two-element linear microphone array are processed to yield measurements of audio direction of arrival (DOA) and of audio-visual synchrony. Using this simple array allows us to use a standard PC audio card for input, but it is only able to recover DOA in contrast to the full 3-D position information discussed in Section 3. In order to allow for real-time processing on a standard personal computer we restrict the complexity of our statistical models. A head tracker obtains stabilized regions of interest for each person in the scene and determines their positions with respect to the camera. A simple measure of audio-visual synchrony is then calculated for each of these regions in a similar style to Hershey and Movellan [1999]. We combine the two cues (DOA and audio-visual synchrony) in simple hypothesis testing framework which yields better performance than using any single cue alone.
176
Advances in Natural Multimodal Dialogue Systems
When multiple people interact with each other or with a conversational kiosk there are no guarantees about where they are located. When people are well-separated, audio DOA is sufficient to determine who is speaking, but when two or more people are standing close together or are in unknown locations outside the field of view of the camera, audio DOA may be insufficient. Measuring audio-visual synchrony in such situations can disambiguate who is speaking by giving some indication of whose lips are in synchrony with the audio. In the case when the speaker is not visible there should be no audio-visual synchrony for any of the visible users thus giving a hint that the speaker is off camera. The usefulness of our approach is shown in results for two kiosk-interaction scenarios. Each scenario involved two or three individuals conversing with each other or the audio-visual rig. Two individuals were tracked and remained visible to the camera at all times. The first scenario involved only two people who remained sufficiently separated throughout the recorded sequence. The second scenario involved three people with the third person not visible in the camera and located fairly close to the second person, thus producing the ambiguous situation we wished to address. Each sequence consisted of over 2000 frames. Three versions of our system were tested on each sequence. The first version used only the DOA information. The second version ignored DOA and used audio-visual synchrony as its measurement, while the third version combined both. Table 8.4. Sequence accuracy results–Percentage of the frames in which the system correctly determined who was speaking (% Accuracy). Sequence Two Person Three Person
DOA Only 89 % 80 %
A/V Sync. Only 76% 77 %
Combined 89 % 89 %
A summary of our results is presented in Table 8.4. Combining both measurements did not improve our systems performance on the two person sequence. This shows that audio DOA is a sufficient measurement for speaker localization when the sources are sufficiently separated. However, in the three person case, adding a measure of audio-visual synchrony helps disambiguate who is speaking. In this sequence the combined system reduced the error by nearly a factor of two. Figure 8.8 shows results for two frames of the three person sequence evaluated using the DOA only system (top) and the combined system (bottom). Using DOA only, the system cannot disambiguate between the right person speaking and an unknown individual who is nearby. Adding an audiovisual synchrony measure clarifies the situation. It is clear that when the right per-
Geometric and Statistical Approaches to Audiovisual Segmentation
177
Figure 8.8. Results for two frames of the three person video sequence. In frame 1875, the person on the right is speaking. In frame 1936, someone off camera is speaking. (Top) Uses only DOA. The first row shows segmented head regions of interest for each tracked person. The second row shows audio DOA information in a plan-view plot of the estimated configuration. The position of each person is marked with an ’x’. The third row shows posterior probabilities for who is speaking. (Bottom) Uses DOA and audio-video synchrony. A visualization of the audio-video synchrony measurement is shown in the second row.
178
Advances in Natural Multimodal Dialogue Systems
son is speaking there is a strong relationship between his mouth and the audio. Furthermore, when an off camera source is speaking their is no audio-visual synchrony for those who are visible. This work is an initial step toward producing robust conversational interfaces that demonstrates that combining geometric and statistical approaches can improve audiovisual segmentation. We are currently exploring different real-time integration techniques involving efficient implementations of the more sophisticated statistical models described in Section 5.
References Becker, S. (1992). An Information-Theoretic Unsupervised Learning Algorithm for Neural Networks. PhD thesis, University of Toronto. Bub, U., Hunke, M., and Waibel, A. (1995). Knowing Who to Listen to in Speech Recognition: Visually Guided Beamforming. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 848–851. Casey, M., Gardner, W., and Basu, S. (1995). Vision Steered Beam-forming and Transaural Rendering for the Artificial Life Interactive Video Environment (ALIVE). In Proceedings of the 99th Convention of the Audio Engineering Society (AES). Preprint 4052. Checka, N., Wilson, K., Rangarajan, V., and Darrell, T. (2003). A Probabilistic Framework for Multi-modal Multi-person Tracking. In Proceedings of Workshop on Multi-Object Tracking. http://www.ai.mit.edu/projects/vip/papers/checka-et-al-womot.pdf. Collobert, M., Feraud, R., LeTourneur, G., Bernier, O., Viallet, J. E., Mahieux, Y., and Collobert, D. (1996). LISTEN: A System for Locating and Tracking Individual Speakers. In Proceedings of Second International Conference on Face and Gesture Recognition, pages 283–288. Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. New York: John Wiley & Sons, Inc. Darrell, T., Demirdjian, D., Checka, N., and Felzenszwalb, P. (2001). PlanView Trajectory Estimation with Dense Stereo Background Models. In Proceedings of International Conference on Computer Vision, volume 2, pages 628–635. Darrell, T., Gordon, G. G., Harville, M., and Woodfill, J. (2000). Integrated Person Tracking Using Stereo, Color, and Pattern Detection. International Journal of Computer Vision, 37(2):199–207. Deco, G. and Obradovic, D. (1996). An Information Theoretic Approach to Neural Computing. New York: Springer Verlag. Fisher, J. W. III and Darrell, T. (2002). Probabalistic Models and Informative Subspaces for Audiovisual Correspondence. In Heyden, A., Sparr, G.,
Geometric and Statistical Approaches to Audiovisual Segmentation
179
Nielsen, M., and Johansen, P., editors, Proceedings of the Seventh European Conference on Computer Vision (ECCV), volume 3, pages 592–603. Springer Lecture Notes in Computer Science 2352. Fisher, J. W. III, Darrell, T., Freeman, W. T., and Viola, P. (2000). Learning Joint Statistical Models for Audio-Visual Fusion and Segregation. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems (NIPS), volume 13, pages 1–7. MIT Press. Fisher, J. W. III and Principe, J. C. (1997). Entropy Manipulation of Arbitrary Nonlinear Mappings. In Principe, J. C., editor, Proceedings of IEEE Workshop on Neural Networks for Signal Processing VII, pages 14–23. Fisher, J. W. III and Principe, J. C. (1998). A Methodology for Information Theoretic Feature Extraction. In Stuberud, A., editor, Proceedings of the IEEE International Joint Conference on Neural Networks, volume 3, pages 1712–1716. Hershey, J. and Movellan, J. (1999). Using Audio-Visual Synchrony to Locate Sounds. In Solla, S. A., Leen, T. K., and Müller, K.-R., editors, Advances in Neural Information Processing Systems (NIPS), volume 12, pages 813–819. MIT Press. Ivanov, Y. A., Bobick, A. F., and Liu, J. (2000). Fast Lighting Independent Background Subtraction. International Journal of Computer Vision, 37(2): 199–207. Krumm, J., Harris, S., Meyers, B., Brummit, B., Hale, M., and Shafer, S. (2000). Multi-Camera Multi-Person Tracking for Easyliving. In Proceedings of the Third IEEE Workshop on Visual Surveillance, pages 3–10. Mahalanobis, A., Kumar, B., and Casasent, D. (1987). Minimum Average Correlation Energy Filters. Applied Optics, 26(17):3633–3640. Meier, U., Stiefelhagen, R., Yang, J., and Waibel, A. (1999). Towards Unrestricted Lipreading. In Proceedings of the Second International Conference on Multimodal Interfaces (ICMI), Hong Kong. Plumbley, M. (1991). On Information Theory and Unsupervised Neural Networks. Technical Report CUED/F-INFENG/TR. 78, Cambridge University Engineering Department, UK. Plumbley, M. and Fallside, S. (1988). An Information-Theoretic Approach to Unsupervised Connectionist Models. In Touretzky, D., Hinton, G., and Sejnowski, T., editors, Proceedings of the 1988 Connectionists Models Summer School, pages 239–245. San Mateo, CA: Morgan Kaufman. Silverman, H. F., Patterson, W. R., and Flanagan, J. L. (1998). The Huge Microphone Array. IEEE Concurrency, pages 36–46. Siracusa, M., Morency, L.-P., Wilson, K., Fisher, J., and Darrell, T. (2003). A Multi-Modal Approach for Determining Speaker Location and Focus. In Proceedings of the Fifth International Conference on Multimodal Interfaces
180
Advances in Natural Multimodal Dialogue Systems
(ICMI), pages 77–80, Vancouver, Canada. http://www.ai.mit.edu/projects/vip/papers/Siracusa icmi2003.pdf. Slaney, M. and Covell, M. (2000). FaceSync: A linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems (NIPS), volume 13, pages 814–820. MIT Press. van Veen, B. D. and Buckley, K. M. (1988). Beamforming: A Versatile Approach to Spatial Filtering. IEEE Acoustics, Speech, and Signal Processing (ASSP) Magazine, 5(2):4–24. Viberg, M. and Krim, H. (1997). Two Decades of Statistical Array Processing. In Proceedings of the 31st Asilomar Conference on Signals, Systems, and Computers, volume 1, pages 775–777. Wang, C. and Brandstein, M. (1999). Multi-Source Face Tracking with Audio and Visual Data. In Proceedings of the IEEE International Workshop on Multimedia Signal Processing, pages 169–174, Copenhagen, Denmark. Wolff, G., Prasad, K. V., Stork, D. G., and Hennecke, M. (1994). Lipreading by Neural Networks: Visual Preprocessing, Learning and Sensory Integration. In Cowan, J., Tesauro, G., and Alspector, J., editors, Proceedings of Neural Information Processing Systems (NIPS-6), pages 1027–1034. Zue, V., Glass, J., Polifroni, J., Pao, C., Hazen, T., and Hetherington, L. (2000). Jupiter: A Telephone-Based Conversational Interface for Weather Information. IEEE Transactions on Speech and Audio Processing, 8(1):85–96.
PART III
ANIMATED TALKING HEADS AND EVALUATION
Chapter 9 THE PSYCHOLOGY AND TECHNOLOGY OF TALKING HEADS: APPLICATIONS IN LANGUAGE LEARNING Dominic W. Massaro Perceptual Science Laboratory Department of Psychology, University of California Santa Cruz, CA 95064, USA [email protected]
Abstract
Given the value of visible speech, our persistent goal has been to develop, evaluate, and apply animated agents to produce accurate visible speech. The goal of our recent research has been to increase the number of agents and to improve the accuracy of visible speech. Perceptual tests indicted positive results of this work. Given this technology and the framework of the fuzzy logical model of perception (FLMP), we have developed computer-assisted speech and language tutors for deaf, hard of hearing, and autistic children. Baldi1 , as the conversational agent, guides students through a variety of exercises designed to teach vocabulary and grammar, to improve speech articulation, and to develop linguistic and phonological awareness. The results indicate that the psychology and technology of Baldi holds great promise in language learning and speech therapy.
Keywords:
Facial animation, visible speech, language learning, speech perception, vocabulary tutor, autism, children with language challenges.
1.
Introduction
The face presents visual information during speech that is critically important for effective communication. While the auditory signal alone is adequate for communication, visual information from movements of the lips, tongue and jaws enhance intelligibility of the acoustic stimulus (particularly in noisy en-
1 Baldi R
is a registered trademark of Dominic W. Massaro.
183 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 183–214. © 2005 Springer. Printed in the Netherlands.
184
Advances in Natural Multimodal Dialogue Systems
vironments). Moreover, speech is enriched by the facial expressions, emotions and gestures produced by a speaker [Massaro, 1998]. The visual components of speech offer a lifeline to those with severe or profound hearing loss. Even for individuals who hear well, these visible aspects of speech are especially important in noisy environments. For individuals with severe or profound hearing loss, understanding visible speech can make the difference in effectively communicating orally with others or a life of relative isolation from oral society [Trychin, 1997]. Our persistent goal has been to develop, evaluate, and apply animated agents to produce accurate visible speech. These agents have a tremendous potential to benefit virtually all individuals, but especially those with hearing problems (> 28,000,000 in the USA), including the millions of people who acquire agerelated hearing loss every year (http://www.nidcd.nih.gov/health/hb.htm), and for whom visible speech takes on increasing importance. One of many applications of animated characters allows the training of individuals with hearing loss to "read" visible speech, and thus facilitate face-to-face oral communication in all situations (educational, social, work-related, etc). These enhanced characters can also function effectively as language tutors, reading tutors, or personal agents in human machine interaction. For the past ten years, my colleagues and I have been improving the accuracy of visible speech produced by an animated talking face - Baldi (Figure 9.1; [Massaro, 1998, chapters 12-14]). Baldi has been used effectively to teach vocabulary to profoundly deaf children at Tucker-Maxon Oral School in a project funded by an NSF Challenge Grant [Barker, 2003; Massaro, 2000]. The same pedagogy and technology has been employed for language learning with autistic children [Massaro et al., 2003]. While Baldi’s visible speech and tongue model probably represent the best of the state of the art in real-time visible speech synthesis by a talking face, experiments have shown that Baldi’s visible speech is not as effective as human faces. Preliminary observations strongly suggest that the specific segmental and prosodic characteristics are not defined optimally. One of our goals, therefore, is to significantly improve the communicative effectiveness of synthetic visual speech.
2.
Facial Animation and Visible Speech Synthesis
Visible speech synthesis is a sub-field of the general areas of speech synthesis and computer facial animation (Chapter 12 in [Massaro, 1998] organizes the representative work that has been done in this area). The goal of the visible speech synthesis in the Perceptual Science Laboratory (PSL) has been to develop a polygon (wireframe) model with realistic motions (but not to duplicate the musculature of the face to control this mask). We call this technique terminal analogue synthesis because its goal is to simply use the final speech
The Psychology and Technology of Talking Heads
185
Figure 9.1. Baldi, a computer-animated talking head, in normal and wireframe presentations, and a close-up of the tongue.
product to control the facial articulation of speech (rather than illustrate the physiological mechanisms that produce it). This method of rendering visible speech synthesis has also proven most successful with audible speech synthesis. One advantage of the terminal analogue synthesis is that calculations of the changing surface shapes in the polygon models can be carried out much faster than those for muscle and tissue simulations. For example, our software can generate a talking face in real time on a commodity PC, whereas muscle and tissue simulations are usually too computationally intensive to perform in real time [Massaro, 1998]. More recently, image synthesis, which joins together images of a real speaker, has been gaining in popularity because of the realism that it provides. These systems also are not capable of real-time synthesis because of their computational intensity. Finally, performance-based synthesis, e.g., [Guenter et al., 1998], does not have the flexibility of saying anything at any time in real time, as does our text-to-speech system. Our own current software [Cohen and Massaro, 1993; Cohen et al., 1996; Cohen et al., 1998; Massaro, 1998] is a descendant of Parke’s software and his particular 3-D talking head [Parke, 1975]. Our modifications over the last 8 years have included increased resolution of the model, additional and modified control parameters, three generations of a tongue (which was lacking in Parke’s model), a new visual speech synthesis coarticulatory control strategy, controls for paralinguistic information and affect in the face, alignment with natural speech, text-to-speech synthesis, and bimodal (auditory/visual) synthesis. Most of our current parameters move vertices (and the polygons formed from these vertices) on the face by geometric functions such as rotation (e.g. jaw rotation) or translation of the vertices in one or more dimensions (e.g., lower and upper lip height, mouth widening). Other parameters work
186
Advances in Natural Multimodal Dialogue Systems
by scaling and interpolating different face subareas. Many of the face shape parameters–such as cheek, neck, or forehead shape, and also some affect parameters such as smiling–use interpolation. We have used phonemes as the basic unit of speech synthesis. In this scheme, any utterance can be represented as a string of successive phonemes, and each phoneme is represented as a set of target values for the control parameters such as jaw rotation, mouth width, etc. Because speech production is a continuous process involving movements of different articulators (e.g., tongue, lips, jaw) having mass and inertia, phoneme utterances are influenced by the context in which they occur by a process called coarticulation. In our visual speech synthesis algorithm [Cohen and Massaro, 1993] [Massaro, 1998, chapter 12], coarticulation is based on a model of speech production using rules that describe the relative dominance of the characteristics of the speech segments. In our model, each segment is specified by a target value for each facial control parameter. For each control parameter of a speech segment, there are also temporal dominance functions dictating the influence of that segment over the control parameter. These dominance functions determine independently for each control parameter how much weight its target value carries against those of neighbouring segments, which will in turn determine the final control values. Our animated face can be aligned with either the output of a speech synthesizer or natural auditory speech [Massaro et al., 2005]. We have also developed the phoneme set and the corresponding target and coarticulation values to allow synthesis of several other languages. These include Spanish (Baldero), Italian (Baldini, [Cosi et al., 2002]), Mandarin (Bao), Arabic (Badr, [Ouni et al., 2003]), French (Balduin) and German (Balthasar). Baldi, can be seen at: http://mambo.ucsc.edu. Baldi’s synthetic tongue is constructed of a polygon surface defined by sagittal and coronal b-spline curves (see Figure 9.1). The control points of these b-spline curves are controlled singly and in pairs by speech articulation control parameters. There are now 9 sagittal and 3 * 7 coronal parameters that are modified to mimic natural tongue movements. The tongue, teeth, and palate interactions during speaking require an algorithm to prevent the tongue from going into rather than colliding with the teeth and palate. To ensure this, we have developed a fast collision detection method to instantiate the appropriate interactions. Two sets of observations of real talkers have been used to inform the appropriate movements of the tongue. These include 1) three dimensional ultrasound measurements of upper tongue surfaces and 2) EPG data collected from a natural talker using a plastic palate insert that incorporates a grid of about a hundred electrodes that detect contact between the tongue and palate at a fast rate (e.g. a full set of measurements 100 times per second). These measurements were made in collaboration with Maureen Stone at John Hop-
The Psychology and Technology of Talking Heads
187
Figure 9.2. High-resolution texture (left) and a high-resolution polygon mesh (right) obtained from a Cyberware laser scan.
kins University. Minimization and optimization routines are used to create animated tongue movements that mimic the observed tongue movements [Cohen et al., 1998].
2.1
Recent Progress in Visible Speech Synthesis
Important goals for the application of talking heads are to have a large gallery of possible agents and to have highly intelligible and realistic synthetic visible speech. Our development of visible speech synthesis is based on facial animation of a single canonical face, called Baldi (see Figure 9.1; [Massaro, 1998]. Although the synthesis, parameter control, coarticulation scheme, and rendering engine are specific to Baldi, we have developed software to reshape our canonical face to match various target facial models. To achieve realistic and accurate synthesis, we use measurements of facial, lip, and tongue movements during speech production to optimize both the static and dynamic accuracy of the visible speech. This optimization process is called minimization because we seek to minimize the error between the empirical observations of real human speech and the speech produced by our synthetic talker [Cohen et al., 1998; Cohen et al., 2001; Cohen et al., 2002].
2.2
Improving the Static Model
A Cyberware 3D laser scanning system is used to enroll new citizens in our gallery of talking heads. To illustrate this procedure, we describe how a Cyberware laser scan of DWM was made, how Baldi’s generic morphology was mapped into the form of DWM, how this head was trained on real data, and how the quality of its speech was evaluated. A laser scan of a new target head produces a very high polygon count representation. Figure 9.2 shows a high-
188
Advances in Natural Multimodal Dialogue Systems
Figure 9.3. Laser scan of high resolution head and canonical Baldi low resolution head with alignment points.
resolution texture mapped Cyberware scan of DWM and the accompanying high-resolution mesh. Rather than trying to animate this high-resolution head (which is impossible to do in real-time with current hardware), our software uses these data to reshape our canonical head to take on the shape of the new target head. In this approach, a human operator marks corresponding facial landmarks on both the laser scan head and the generic Baldi head (Figure 9.3). Our canonical head is then warped until it assumes as closely as possible the shape of the target head, with the additional constraint that the landmarks of the canonical face move to positions corresponding to those on the target face. This morphing algorithm is based on the work of Kent, Carlson, and Parent [1992]. In this approach, all the triangles making up the source and target models are projected on a unit sphere centred at the origin. The models must be convex or star shaped so that there is at least one point within the model from where all vertices of all triangles are visible. This can be confirmed by a separate vertex visibility test procedure that checks for this requirement. If a model is non-convex or non-star shaped, then it may be necessary to ignore or modify these sections of the model. In order to meet this requirement, portions of the ears, eyes, and lips are handled separately from the rest of Baldi’s head. For the main portion of the head, we first translate all vertices so that the centre point of the model coincides with the coordinate system origin. We then move the vertices so that they are at a unit distance from the origin. At this point, the vertices of the triangles making up the model are on the surface of the unit sphere. This is done to both Baldi’s source head and the Cyberware laser scan target head. The landmarks are then connected into a mesh of their
The Psychology and Technology of Talking Heads
189
own. As these landmarks are moved into their new positions, the non-landmark points contained in triangles defined by the landmark points are moved to keep their relative positions within the landmark triangles. Then, for each of these source vertices we determine the location on the target model to which a given source vertex projects. This gives us a homeomorphic mapping (1 to 1 and onto) between source and target datasets, and we can thereby determine the morph coordinate of each source vertex as a barycentric coordinate of the target triangle to which it maps. This mapping guides the final morph between the source and target datasets. A different technique is used to interpolate polygon patches, which were earlier culled out of the target model on account of being non-convex. These patches are instead stretched to fit the new boundaries of the culled regions in the morphed head. Because this technique does not capture as much of the target shape’s detail as our main method of interpolation, we try to minimize the size of the patches that are culled in this manner. To output the final topology the program then reconnects all the source polygonal patches and outputs them in a single topology file. The source connectivity is not disturbed and is the same as the original source connectivity.
2.3
Improving the Dynamic Model
To improve the intelligibility of our talking heads, we have developed software for using dynamic 3D optical measurements (Optotrak) of points on a real face while talking. In one study, we recorded a large speech database with 19 markers affixed to the face of DWM at important locations [Cohen et al., 2002]. Fitting of these dynamic data occurred in several stages. To begin, we assigned points on the surface of the synthetic model that best correspond to the Optotrak measurement points. In the training, the Optotrak data were adjusted in rotation, translation, and scale to best match the corresponding points marked on the synthetic face. The data collected for the training consisted of 100 CID sentences recorded by DWM speaking in a fairly natural manner. In the first stage fit, for each time frame (30 fps) we automatically and iteratively adjusted 11 facial control parameters of the face to get the best fit (the least sum of squared distances) between the Optotrak measurements and the corresponding point locations on the synthetic face. In the second stage fit, the goal was to tune the segment definitions (parameter targets, dominance function strengths, attack and decay rates, and peak strength time offsets) used in our coarticulation algorithm [Cohen and Massaro, 1993] to get the best fit with the parameter tracks obtained in the first stage fit. We first used Viterbi alignment on the acoustic speech data of each sentence to obtain the phoneme durations used to synthesize each
190
Advances in Natural Multimodal Dialogue Systems
Figure 9.4. Proportion words correct as a function of initial consonant of all words in the test sentences for auditory alone, synthetic and real face conditions.
sentence. Given the phonemes and durations, we used our standard parametric phoneme synthesis and coarticulation algorithm to synthesize the parameter tracks for all 100 CID sentences. These were compared with the parameter tracks obtained from the first stage fit, the error computed, and the parameters adjusted until the best fit was achieved.
2.4
Perceptual Evaluation
We carried out a perceptual recognition experiment with human subjects to evaluate how well this improved synthetic talker conveyed speech information relative to the real talker. To do this we presented the 100 CID sentences in three conditions: auditory alone, auditory + synthetic talker, and auditory + real talker. In all cases there was white (speech band) noise added to the audio channel. Each of the 100 CID sentences was presented in each of the three modalities for a total of 300 trials. Each trial began with the presentation of the sentence, and subjects then typed in as many words as they could recognize. Students in an introductory psychology course served as subjects. Figure 9.4 shows the proportion of correct words reported as a function of the initial consonant under the three presentation conditions. There was a significant advantage of having the visible speech, and the advantage of the synthetic head was equivalent to the original video of the real face. Overall, the proportion of correctly reported words for the three conditions was 0.22 auditory, 0.43 synthetic face, and 0.42 with the real face. The results of the current evaluation study, using the stage 1 best fitting parameters is encouraging. In studies to follow, we’ll be comparing performance
The Psychology and Technology of Talking Heads
191
with visual TTS synthesis based on the segment definitions from the stage 2 fits, both for single segments, context sensitive segments, and also using concatenation of diphone sized chunks from the stage 1 fits. In addition, we will be using a higher resolution canonical head with many additional polygons and an improved texture map.
3.
Speech Science
Speech science evolved as the study of a unimodal phenomenon. Speech was viewed as a solely auditory event, as captured by the seminal speech-chain illustration shown in [Denes and Pinson, 1963]. This view is no longer viable as witnessed by a burgeoning record of research findings. Speech as a multimodal phenomenon is supported by experiments indicating that our perception and understanding are influenced by a speaker’s face and accompanying gestures, as well as the actual sound of the speech. Many communication environments involve a noisy auditory channel, which degrades speech perception and recognition. Visible speech from the talker’s face improves intelligibility in these situations. Visible speech also is an important communication channel for individuals with hearing loss and others with specific deficits in processing auditory information. The number of words understood from a degraded auditory message can often be doubled by pairing the message with visible speech from the talker’s face [Jesse et al., 2001]. The combination of auditory and visual speech has been called super-additive because their combination can lead to accuracy that is much greater than accuracy on either modality alone. Furthermore, the strong influence of visible speech is not limited to situations with degraded auditory input. A perceiver’s recognition of an auditory-visual syllable reflects the contribution of both sound and sight. For example, if the ambiguous auditory sentence, My bab pop me poo brive, is paired with the visible sentence, My gag kok me koo grive, the perceiver is likely to hear, My dad taught me to drive. Two ambiguous sources of information are combined to create a meaningful interpretation [Massaro, 1998].
3.1
Value of Multimodal Speech
There are several reasons why the use of auditory and visual information together in face to face interactions is so successful. These include a) robustness of visual speech, b) complementarity of auditory and visual speech, and c) optimal integration of these two sources of information. Speech reading, or the ability to obtain speech information from the face, is robust in that perceivers are fairly good at speech reading even when they are not looking directly at the talker’s lips. Furthermore, accuracy is not dramatically reduced when the facial image is blurred (because of poor vision, for example), when the face
192
Advances in Natural Multimodal Dialogue Systems
is viewed from above, below, or in profile, or when there is a large distance between the talker and the viewer [Massaro, 1998, Chapter 14]. Complementarity of auditory and visual information simply means that one of the sources is strong when the other is weak. A distinction between two segments robustly conveyed in one modality is relatively ambiguous in the other modality. For example, the place difference between /ba/ and /da/ is easy to see but relatively difficult to hear. On the other hand, the voicing difference between /ba/ and /pa/ is relatively easy to hear but very difficult to discriminate visually. Two complementary sources of information make their combined use much more informative than would be the case if the two sources were noncomplementary, or redundant [Massaro, 1998, pages 424–427]. The final reason is that perceivers combine or integrate the auditory and visual sources of information in an optimally efficient manner. It might seem obvious, given the joint influence of audible and visible speech, that these two sources of information are combined or integrated. Integration is not the only process that can account for an advantage of two sources of information relative to just one, however. There are many possible ways to treat two sources of information: use only the most informative source; use only the auditory source if it is identified and if it isn’t use the visible source; average the two sources together; or integrate them in such a fashion in which both sources are used but that the least ambiguous source has the most influence. Research has shown that perceivers do in fact integrate the information available from each modality to perform as efficiently as possible. We now describe a model that predicts this optimally efficient process of combination [Massaro, 1998].
3.2
Fuzzy Logical Model of Perception
The fuzzy logical model of perception (FLMP), shown in Figure 9.5, assumes necessarily successive but overlapping stages of processing. The perceiver of speech is viewed as having multiple sources of information supporting the identification and interpretation of the language input. The model assumes that 1) each source of information is evaluated to give the continuous degree to which that source supports various alternatives, 2) the sources of information are evaluated independently of one another, 3) the sources are integrated to provide an overall degree of support for each alternative, and 4) perceptual identification and interpretation follows the relative degree of support among the alternatives [Massaro et al., 2001; Massaro, 2002]. The paradigm that we have developed permits us to determine how visible speech is processed and integrated with other sources of information. The results also inform us about which of the many potentially functional cues are actually used by human observers [Massaro, 1987, Chapter 1]. The systematic variation of properties of the speech signal combined with the quantitative test
193
The Psychology and Technology of Talking Heads Ai Vj
Evaluation ai
vj
Integration sk
Decision
Rk
Figure 9.5. Schematic representation of the three processes involved in perceptual recognition. The three processes are shown to proceed left to right in time to illustrate their necessarily successive but overlapping processing. These processes make use of prototypes stored in long-term memory. The sources of information are represented by uppercase letters. Auditory information is represented by Ai and visual information by Vj . The evaluation process transforms these sources of information into psychological values (indicated by lowercase letters ai and vj ) These sources are then integrated to give an overall degree of support, sk , for each speech alternative k. The decision operation maps the outputs of integration into some response alternative, Rk . The response can take the form of a discrete decision or a rating of the degree to which the alternative is likely.
of models of speech perception enables the investigator to test the psychological validity of different cues. This paradigm has already proven to be effective in the study of audible, visible, and bimodal speech perception [Massaro, 1987; Massaro, 1998]. Thus, our research strategy not only addresses how different sources of information are evaluated and integrated, but can uncover what sources of information are actually used. We believe that the research paradigm confronts both the important psychophysical question of the nature of information and the process question of how the information is transformed and mapped into behaviour. Many independent tests point to the viability of the FLMP as a general description of pattern recognition. The FLMP is centred around a universal law of how people integrate multiple sources of information. This law and its relationship to other laws is developed in detail in [Massaro, 1998]. The FLMP is also valuable because it motivates our approach to language learning. Baldi can display a midsagital view, or the skin on the face can be made transparent to reveal the internal articulators. The orientation of the face can be changed to display different viewpoints while speaking, such as a side view, or a view from the back of the head [Massaro, 1999; Massaro, 2000]. The auditory and visual speech can also be independently controlled and manipulated, permitting customized enhancements of the informative characteristics of speech. These features offer novel approaches in language training, permitting one to pedagogically illustrate appropriate articulations that are usually
194
Advances in Natural Multimodal Dialogue Systems
hidden by the face. This technology has the potential to help individuals with language delays and deficits, and we have been utilizing Baldi to carry out language tutoring with hard of hearing children and children with autism.
4.
Language Learning
As with most issues in social science, there is no consensus on the best way to teach or to learn language. There is agreement, however, about the importance of time on task; learning and retention are positively correlated with the time spent learning. Our technology offers a platform for unlimited instruction, which can be initiated when and wherever the child and/or supervisor chooses. Baldi and the accompanying lessons are perpetual. Take, for example, children with autism, who have irregular sleep patterns. A child could conceivably wake in the middle of the night and participate in language learning with Baldi as his or her friendly guide. Several advantages of utilizing a computer-animated agent as a language tutor are clear, including the popularity of computers and embodied conversational agents with children with autism. A second advantage is the availability of the program. Instruction is always available to the child, 24 hours a day 365 days a year. Furthermore, instruction occurs in a one-on-one learning environment for the students. We have found that the students enjoy working with Baldi because he offers extreme patience, he doesn’t become angry, tired, or bored, and he is in effect a perpetual teaching machine.
4.1
Vocabulary Learning
Vocabulary knowledge is critically important for understanding the world and for language competence in both spoken language and in reading. There is empirical evidence that very young children more easily form conceptual categories when category labels are available than when they are not [Waxman, 2002]. There is also evidence that there is a sudden increase in the rate at which new words are learned once the child knows about 150 words. Grammatical skill also emerges at this time. Even children experiencing language delays because of specific language impairment benefit once this level of word knowledge is obtained. It follows that increasing the pervasiveness and effectiveness of vocabulary learning offers a huge opportunity for improving conceptual knowledge and language competence for all individuals, whether or not they are disadvantaged because of sensory limitations, learning disabilities, or social condition. Finally, it is well-known that vocabulary knowledge is positively correlated with both listening and reading comprehension [Anderson and Freebody, 1981]. Our Language Tutor, Baldi, encompasses and instantiates the developments in the pedagogy of how language is learned, remembered and used. Educa-
The Psychology and Technology of Talking Heads
195
tion research has shown that children can be taught new word meanings by using direct instruction, e.g., [McKeown et al., 1985; Stahl, 1986]. It has also been convincing demonstrated that direct teaching of vocabulary by computer software is possible, and that an interactive multimedia environment is ideally suited for this learning [Berninger and Richards, 2002; Wood, 2001]. As cogently observed by Wood [2001], “Products that emphasize multimodal learning, often by combining many of the features discussed above, perhaps make the greatest contribution to dynamic vocabulary learning. Multimodal features not only help keep children actively engaged in their own learning, but also accommodate a range of learning styles by offering several entry points: When children can see new words in context, hear them pronounced, type them into a journal, and cut and paste an accompanying illustration (or create their own), the potential for learning can be dramatically increased.” Following this logic, many aspects of our lessons enhance and reinforce learning. For example, the existing program and planned modifications make it possible for the student to 1) Observe the words being spoken by a realistic talking interlocutor (Baldi), 2) See the word as written as well as spoken, 3) See visual images of referents of the words or view an animation of a meaningful scene, 4) Click on or point to the referent, 5) Hear himself or herself say the word, 6) Spell the word by typing, 7) Observe the word used in context, and 8) Incorporate the word into his or her own speech act. Other benefits of our program include the ability to seamlessly meld spoken and written language, provide a semblance of a game-playing experience while actually learning, and to lead the child along a growth path that always bridges his or her current “zone of proximal development.”
4.2
Description of Language Wizard and Tutor
The Language Tutor and Wizard is a user-friendly application that allows the presentation and composition of lessons with minimal computer experience [Bosseler and Massaro, 2003; Barker, 2003].2 The lessons encompass and instantiate the developments in the pedagogy of how language is learned, remembered and used. Figure 9.6 shows a view of the screen from the Presentation exercise in a prototypical lesson. In this lesson, the students learn to identify vegetables. In this exercise, text corresponding to each item is presented when the item is tutored. An outlined region around the zucchini designates the selected object. Emoticon “stickers” (not shown) can also be used as feedback for the responses. 2 The
development of this application was carried out in collaboration with the Center for Spoken Language Understanding at the Oregon Health Sciences University and the Tucker Maxon Oral School, both in Portland, Oregon.
196
Advances in Natural Multimodal Dialogue Systems
Figure 9.6. A prototypical lesson illustrating the format of the Language Tutor. Each lesson contains Baldi, the vocabulary items and written text and captioning (optional, not shown), and emoticons (not shown). In this application the students learn to identify vegetables. For example, Baldi says "this is a zucchini" in the Presentation exercise.
All of the exercises required the children to respond to spoken directives such as “click on the little chair”, or “find the red fox”. These images were associated with the corresponding spoken vocabulary words, see [Bosseler and Massaro, 2003] for vocabulary examples). The items became highlighted whenever the mouse passed over that region. The student selected his or her response by clicking the mouse on one of the designated areas. The Language Wizard consists of 8 different exercises. These exercises are pre-test, presentation, recognition, reading, spelling, imitation, elicitation, and post-test. The Wizard is equipped with easily changeable default settings that determine what Baldi says and how he says it, the oral feedback and emoticons given for responses, the number of attempts permitted for the student in each exercise, and the number of times each item is presented. The program automatically creates and writes all student performance information to a log file stored in the student’s directory.
The Psychology and Technology of Talking Heads
5.
197
Research on the Educational Impact of Animated Tutors
Research has shown that this pedagogical and technological program is highly effective for both children with hearing loss and children with autism. Processing information presented via the visual modality reinforces learning [Courchesne et al., 1994] and is consistent with the TEEACH [Schopler et al., 1995] suggestion for the use of visually presented material. These children tend to have major difficulties in acquiring language, and they serve as particularly challenging tests for the effectiveness of our pedagogy. We now describe some recent research carried out to evaluate our animated tutor to teach both children with hearing loss [Barker, 2003; Massaro et al., 2003] and children with autism [Bosseler and Massaro, 2003].
5.1
Improving the Vocabulary of Hard of Hearing Children
It is well-known that hard of hearing children have significant deficits in vocabulary knowledge. In many cases, the children do not have names for specific things and concepts. These children often communicate with phrases such as “the window in the front of the car,” “the big shelf where the sink is,” or “the step by the street” rather than “windshield,” “counter,” or “curb” [Barker, 2003, citing Pat Stone]. The Language Tutor has been in use at the Tucker Maxon Oral School in Portland, Oregon, and Barker [2003] evaluated its effectiveness. Students were given cameras to photograph objects at home and surroundings. The pictures of these objects were then incorporated as items in the lessons. A given lesson had between 10 and 15 items. Students worked on the items about 10 minutes a day until they reached 100% on the post-test. They then moved on to another lesson. About one month after each successful (100%) post-test, they were retested on the same items. Ten girls and nine boys the “upper school” and the “lower school” participated in the applications. There were six hard of hearing children and one hearing child between 8 and 10 years of age in the lower school. Ten hard of hearing and two hearing children, between 11 and 14 years of age, participated from the upper school. Similar results were found for both age groups. Students knew about onehalf of the items without any learning, they successfully learned the other half of the items, and retained about one-half of the newly learned items when retested 30 days later. These results demonstrate the effectiveness of the language Tutor for learning and retaining new vocabulary. The results of the Barker evaluation [Barker, 2003] indicated that hard of hearing children learned a significant number of new words, and retained about half of them a month after training ended. No control groups were used in that evaluation, however, and it is possible the children were learning the words
198
Advances in Natural Multimodal Dialogue Systems
outside of the Language Tutor environment. Furthermore, the time course of learning with the Language Tutor was not evaluated. It is of interest how quickly words can be learned with the Language Tutor to give some idea of how this learning environment would compare to a real teacher. Finally, both identification and production of the words should be assessed given that only identification was measured previously.
5.2
Testing the Validity of the Vocabulary Tutor
To address these issues, Massaro and Light [2004a] carried out an experiment based on a within student multiple baseline design [Baer et al., 1968] where certain words were continuously being tested while other words were being tested and trained. Although the student’s instructors and speech therapists agreed not to teach or use these words during our investigation, it is still possible that the words could be learned outside of the Language Tutor environment. The single student multiple baseline design monitors this possibility by providing a continuous measure of the knowledge of words that are not being trained. Thus, any significant differences in performance on the trained words and untrained words can be attributed to the Language Tutor training program itself rather than some other factor. Eight hard of hearing children, 2 male ages 6 and 7, 6 female ages 9 and 10, were recruited from The Jackson Hearing Center in Los Altos, California and were given parental consent to participate. The male students were in grade 1 and the female students in grade 4 respectively and all students needed help with their vocabulary building skills as suggested by their regular day teachers. One child had a cochlear implant and the seven other children had hearing aids in both ears except for one child with an aid in just a single ear. Using the Language Wizard, the experimenter developed a set of lessons with a collection of vocabulary items that was individually tailored for each student. Each collection of items was comprised of 24 items, broken down into 3 categories of 8 items each. Three lessons with 8 items each were made for each child. Images of the vocabulary items were presented on the screen next to Baldi as he spoke, as illustrated in Figure 9.6. Some of the exercises required the child to respond to Baldi’s instructions such as “click on the cabbage”, or “show me the yam”, by clicking on the highlighted area or by moving the computer mouse over the appropriate image until an item was highlighted and then clicking on it. Two other exercises asked the child to recognize the written word and to type the word, respectively. The production exercises asked the child to repeat after Baldi once he named the highlighted image or to name the highlighted image on their own, followed by Baldi’s naming of the image. Figure 9.7 gives the results of identification and production for one of the eight students. The results were highly consistent across the eight students.
199
The Psychology and Technology of Talking Heads S tu d e n t 8 Pr e tr ain in g
T r ain in g
Po s ttr ain in g
Pr o p o r tio n C o r r e ct
1 0.8 0.6 0.4 0.2
S et 1 0
p r o p o r tio n co r r e ct
1 0.8 0.6 identif ic ation 0.4
produc tion
0.2
S et 2
0
1 0 .8 0 .6 0 .4 0 .2
S et 3 0 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
S e s s ion
Figure 9.7. Proportion of correctly identified (solid black diamonds) and correctly produced (empty white squares) items across the testing sessions for student 1. The training on a set of words occurred between the two vertical bars. The figure illustrates that once training was implemented identification performance increased dramatically, and remained accurate without further training (from [Massaro and Light, 2004a]).
As expected, identification accuracy (mean = .72) was always higher than production accuracy (mean = .64). This result is not unexpected because a stu-
200
Advances in Natural Multimodal Dialogue Systems
dent could know the name of an item without being able to pronounce it correctly. There was little knowledge of the test items shown without training, even though these items were repeatedly tested for many days. Once training began on a set of items, however, performance improved fairly quickly until asymptotic knowledge was obtained. This knowledge did not degrade after training on these words ended and training on other words took place. In addition, a reassessment test given about 4 weeks after completion of the experiment revealed that the students retained the items that were learned. The average number of trials required to reach criterion was 5, 4.3, and 3.4 for mastering the first, second, and third sets of categories. Given that the word lists were randomized across participants, differences in the difficulty of the word sets was probably not responsible for this difference. Learning vocabulary actually involves three different things to be learned. Stimulus learning involves recognizing the stimulus, response leaning requires acquiring the appropriate response, and stimulus-response learning requires an association of the stimulus and the response. The testing of the items actually gave the students experience that could contribute to both stimulus learning and response learning. Thus when the items were finally trained, the students only had to master the stimulus-response association. This would give an advantage of learning the second and third sets of words relative to the first and second, once training was initiated.
5.3
Improving the Vocabulary of Autistic Children
Autism is a spectrum disorder characterized by a variety of characteristics, which usually include perceptual, cognitive, and social differences. Among the defining characteristics of autism, the limited ability to produce and comprehend spoken language is the most common factor leading to diagnosis [American Psychiatric Association, 1994]. The language and communicative deficits extend across a broad range of expression [Tager-Flusberg, 1999]. Individual variations occur in the degree to which these children develop the fundamental lexical, semantic, syntactic, phonological, and pragmatic components of language including those who fail to develop one or more of these elements of language comprehension and production. Approximately one-half of the autistic population fails to develop any form of functional language [Tager-Flusberg, 2000]. Within the population that does develop language, the onset and rate at which the children pass through linguistic milestones are often delayed compared to non-autistic children (e.g. no single words by age 2 years, no communicative phrases by age 3) [American Psychiatric Association, 1994]. The ability to label objects is often severely delayed in this population as well as the deviant use and knowledge of verbs and adjectives. Van Lancker et al. [1991] investigated the abilities of autistic
The Psychology and Technology of Talking Heads
201
and schizophrenic children to identify concrete nouns, non-emotional adjectives, and emotional adjectives. The results showed that the performance of children with autism was below controls in all three areas. Despite the prevalence of language delays in autistic individuals, formalized research has been limited, partly due to the social challenges inherent in this population [Tager-Flusberg, 2000]. Intervention programs for children with autism typically emphasize developing speech and communication skills (e.g. TEEACH, Applied Behavioural Analysis). These programs most often focus on the fundamental lexical, semantic, syntactic, phonological, and pragmatic components of language. The behavioural difficulties speech therapists and instructors encounter, such as lack of cooperation, aggression, and lack of motivation to communicate, create difficult situations that are not optimal for learning. Thus, creating motivational environments necessary to develop these language skills introduces many inherent obstacles [Tager-Flusberg, 2000]. In this study [Bosseler and Massaro, 2003], the Tutors were constructed and run on a 600 MHz PC with 128 MB RAM hard drive running Microsoft Windows NT 4 with a Gforce 256 AGP-V6800 DDR graphics board. The tutorials were presented on a Graphic Series view Sonic 20” monitor. All students wore a Plantronics PC Headset model SR1. Students completed 2 sessions a week, a minimum of 2 lessons per session, and an average of 3, and sometimes as many as 8. The sessions lasted between 10 and 40 minutes. A total of 559 different vocabulary items were selected from the curriculum of both schools for a total of over 84 unique vocabulary lessons. A series of observations by the experimenter during the course of each lesson led to many changes in the program, including the use of headsets, isolating the student from the rest of the class and removal of negative verbal feedback from Baldi (such as, “No, (student) that’s not right”. The students appeared to enjoy working with Baldi. We documented the children saying such things as “Hi Baldi” and “I love you Baldi”. The stickers generated for correct (happy face) and incorrect (sad face) responses proved to be an effective way to provide feedback for the children, although some students displayed frustration when he or she received more than one sad face. The children responded to the happy faces by saying such things like “Look, I got them all right”, or laughing when a happy face appeared. We also observed the students providing verbal praise to themselves such as “Good job”, or prompting the experimenter to say “Good job” after every response. For the autistic children, several hundred vocabulary tutors were constructed, consisting of various vocabulary items selected from the curriculum of two schools. The children were administered the tutorial lessons until 100% accuracy was attained on the post-test exercise. Once 100% accuracy was attained on the final post-test module, the child did not see these lessons again until reassessment approximately 30 days later.
202
Advances in Natural Multimodal Dialogue Systems
Number of Vocabulary Items
Evaluation of Language Wizard/Player 100 90 80 70 60 50 40 30 20 10 0
49
0
7
49
42
0 39
39
39
Assessment
Final Post-test
30 days Later
Incorrect Learned known
Condition Figure 9.8. The mean observed proportion of correct identifications for the initial assessment, final posttest and reassessment for each of the seven students. reassessment. The results reveal these seven students were able to accurately identify significantly more words during the reassessment than the initial assessment (from [Bosseler and Massaro, 2003]).
Figure 9.8 shows that the children learned many new words, grammatical constructions, and concepts, proving that the language tutors are a valuable learning environment for these children. In order to assess how well the children would retain the vocabulary items that were learned during the tutorial lesson, we administered the assessment test to the student at least 30 days following the final post-test. As can be seen in Figure 9.8, the students were able to recall 85% of the newly-learned vocabulary items at least 30 days following training.
5.4
Validity and Generalization
Although all of the children demonstrated learning from initial assessment to final reassessment, the children might have been learning the words outside of our program, for example, from speech therapists, at home, or in their school curriculum. Furthermore, we questioned whether the vocabulary knowledge would generalize to new pictorial instances of the words. To address these issues we conducted a second experiment. Collaborating with the children’s instructors and speech therapists, we gathered an assortment of vocabulary words that the children supposedly did not know. We used these words in the Horner
The Psychology and Technology of Talking Heads
203
and Baer [1978] single subject multiple probe design. We randomly separated the words to be trained into three sets, established individual pre-training performance for each set of vocabulary items, and trained on the first set of words while probing performance for both the trained and untrained sets of words. Once the student was able to attain 100% identification accuracy during a training session, a generalization probe to new instances of the vocabulary images was initiated. If the child did not meet the criterion, he or she was trained on these new images. Generalization training continued until the criterion was met, at which time training began on the next set of words. Probe tests continued on the original learned set of words and images until the end of the study. We continued this procedure until the student completed training on all three sets of words. Our goal was to observe a significant increase in identification accuracy during the post-training sessions relative to the pre-training sessions. Figure 9.9 displays the proportion of correct responses for a typical student during the probe sessions conducted at pre-training and post-training for each of the three word sets. The vertical lines in each of the three panels indicates the last pre-training session before the onset of training. Some of the words were clearly known prior to training, and were even learned to some degree without training. As can be seen in the figure, however, training was necessary for substantial learning to occur. In addition, the children were able to generalize accurate identification to four instances of untrained images. In summary, the goal of these investigations was to evaluate the potential of using a computer-animated talking tutor for children with language delays. The results showed a significant gain in vocabulary. We also found that the students were able to recall much of the new vocabulary when reassessed 30 days after learning. Follow-up research showed that the learning is indeed occurring from the computer program and vocabulary knowledge can transfer to novel images. It should be emphasized that the present studies used a within-subject design (with fewer participants) than the field’s general acceptance of between-subject designs. There are several limitations with between-subject designs, however. First, different groups might differ on a pretest so that differences in the pretestposttest differences might not be a valid dependent measure [Loftus, 1978]. Second, even if subjects do not differ in the pretest, they might differ in their learning ability. Thus group differences might reflect other differences rather than the learning conditions. These potential limitations are less problematic in the within-subject design used in the current research. Furthermore, the multiple baseline design enforces a within-subject comparison in which the effectiveness of the independent variable can be evaluated directly. In our case, a child was continuously tested on items that had not yet been learned, were currently being learned, or had already been learned. The direct comparisons
204
Advances in Natural Multimodal Dialogue Systems
Figure 9.9. Proportion correct during the Pretraining, Posttraining, and Generalization for one of the six students. The vertical lines separate the Pretraining and Posttraining conditions. Generalization results are given by the open squares (from [Bosseler and Massaro, 2003]).
among these conditions showed conclusively that the training program was responsible for the learning. An obvious question is how effective our training program is relative to a live teacher. Given the fairly fast rate of learning for both the hard of hearing and autistic children, the potential advantage of a live teacher cannot be very large. Even if learning with the live teacher is significantly faster, our program is still valuable because it is low cost and always available,
The Psychology and Technology of Talking Heads
5.5
205
Value of the Face for Autistic Children
We believe that the children in our investigation profited from having the face and that seeing and hearing spoken language can better guide language learning than either modality alone. A direct test of this hypothesis would involve comparing learning with and without the face. The purpose of this investigation was to evaluate whether pairing an animated tutor, Baldi, with audible speech in vocabulary training facilitates learning and retention relative to presenting the audible speech alone [Massaro and Bosseler, 2003]. If children with autism do not extract meaningful information from the face, then we would expect to see no difference in learning between the two conditions. This evaluation was carried out using two within-subject experimental conditions: training with the face and the voice and training with the voice only. We evaluated whether the face would increase the rate of learning for both receptive and verbal production measures. To accomplish our goal we compared the two conditions according to an alternating treatment design [Baer et al., 1968; Horner and Baer, 1978] in which each student received each of the two learning conditions concurrently, the order of presentation of the two conditions counterbalanced across days. This alternating treatment design permitted us to assess the individual performance of word identification and production, eliminating inter-subject variability and permitted a direct observation of the two treatment conditions [Baer et al., 1968; Horner and Baer, 1978]. To determine the effect of training on retention, an additional testing block was carried once training was terminated. Figure 9.10 gives the identification results for one of the five students for the pre-training, training Post-test, and post-training blocks. The left and right vertical lines in figure separate these three conditions. As can be seen in the figure, this student showed a very large advantage of having the face during learning. Across all 5 students, there was faster learning and more retention with the face than without the face. In summary, the studies using the Language Wizard/Tutor show that children with language challenges were able to learn a significant amount of new vocabulary. By implementing experimental controls in our evaluation, we were able to conclude that the Language Wizard/Tutor was responsible for this learning. Finally, by systematically varying whether Baldi’s face was present during the language tutoring, we learned that the presence of the face contributed significantly to the language learning.
5.6
Training Speech Production
Baldi can actually provide more information than a natural face. Baldi has a tongue, hard palate and three-dimensional teeth and his internal articulatory movements have been trained with electropalatography and ultrasound data
206
Advances in Natural Multimodal Dialogue Systems
Figure 9.10. Mean proportion correct receptive responses for one of the 5 students during pretraining (first 3 blocks), training, and post-training (last 3 blocks) as a function of training block for the face and no face conditions.
from natural speech [Cohen et al., 1998]. Baldi can be programmed to display a midsagital view, or the skin on the face can be made transparent to reveal the internal articulators. The orientation of the face can be changed to display different viewpoints while speaking, such as a side view, or a view from the back of the head [Massaro, 1999]. The auditory and visual speech can also be independently controlled and manipulated, permitting customized enhancements of the informative characteristics of speech. These features offer novel approaches in language training, permitting one to pedagogically illustrate appropriate articulations that are usually hidden by the face. More generally, additional research should investigate whether the influence of several modalities on language processing provide a productive approach to language learning. Children with hearing loss require guided instruction in speech perception and production. Some of the distinctions in spoken language cannot be heard with degraded hearing–even when the hearing loss has been compensated by hearing aids or cochlear implants. To overcome this limitation, we use visible speech when providing our stimuli. Based on reading research [Torgesen et al., 1999], we expected that visible cues would allow for heightened awareness of the articulation of these segments and assist in the training process. Although many of the subtle distinctions among segments are not visible on the outside of the face, the skin of our talking head can be made transparent so that the inside of the vocal tract is visible, or we can present a cutaway view of the head along the sagittal plane. As an example, a unique view of Baldi’s internal articulators can be presented by rotating the exposed head and vocal tract to be oriented away from the student. It is possible that this back-of-head view would be much more conducive to learning language production. The tongue in this view moves away from and towards the student in the same way as the student’s own tongue would move. This correspondence between views
The Psychology and Technology of Talking Heads
207
Figure 9.11. The four presentation conditions of Baldi with transparent skin revealing inside articulators (back view, sagittal view, side view, front view).
of the target and the student’s articulators might facilitate speech production learning. One analogy is the way one might use a map. We often orient the map in the direction we are headed to make it easier to follow (e.g. turning right on the map is equivalent to turning right in reality). Another characteristic of the training is to provide additional cues for visible speech perception. Baldi can illustrate the articulatory movements, and he can be made even more informative by embellishing of the visible speech with added features. Several alternatives are obvious for distinguishing phonemes that have similar visible articulations, such as the difference between voiced and voiceless segments. For instance, showing visual indications of vocal cord vibration and turbulent airflow can be used to increase awareness about voiced versus voiceless distinctions. These embellished speech cues could make the face more informative than it normally is.
5.6.1 Empirical test of speech production training. In the Massaro and Light [2004b] study, hard of hearing students were trained to discriminate minimal pairs of words bimodally (auditorily and visually), and were also trained to produce various speech segments by visual information about how the inside oral articulators work during speech production. As shown in Figure 9.11, the articulators were displayed from different vantage points so that the subtleties of articulation could be optimally visualized. The speech was also slowed down significantly to emphasize and elongate the target phonemes, allowing for clearer understanding of how the target segment is produced in isolation or with other segments. During production training, different illustrations were used to train different distinctions. Although any given speech sound can be produced in a variety of ways, a prototypical production was always used. Supplementary visual indications of vocal cord vibration and turbulent airflow were used to distinguish
208
Advances in Natural Multimodal Dialogue Systems
the voiced from the voiceless cognates. The major differences in production of these sounds are the amount of turbulent airflow and vocal cord vibration that takes place (e.g. voiced segments: vocal cord vibration with minimal turbulent airflow; voiceless segments: no vocal cord vibration with significant turbulent airflow). Although the internal views of the oral cavity were similar for these cognate pairs, they differed on the supplementary voicing features. For consonant clusters, we presented a view of the internal articulators during the production to illustrate the transition from one articulatory position to the next. Finally, both the visible internal articulation and supplementary voicing features were informative for fricative versus affricate training. An affricate is a stop followed by a homorganic fricative. The time course of articulation and the how the air escaped the mouth (e.g. fricative: slow, consistent turbulent airflow, affricate: quick, abrupt turbulent airflow) differed. The production of speech segments was trained in both isolated segments and word contexts. Successful perceptual learning has been reported to depend on the presence of stimulus variability in the training materials [Kirk et al., 1997]. In the present study, we varied the trained speech segments on various dimensions such as segment environment (beginning/end of word) and neighbouring vowel quality (height and front/backness features), and neighbouring consonant quality (place and manner features) in the case of consonant cluster training, to optimize learning. Ideally, training of a target segment would generalize to any word, trained or untrained. In an attempt to assess whether or not the learning of specific segments was restricted to the words involved in our training, we included both trained and untrained words in our pre-test and post-test measures. This contrast allowed us to test whether the training generalized to new words. A follow up measure allowed us to evaluate retention of training six weeks after post-test. We expected that performance would be greater than pre-test but not as high as post-test levels due to discontinued use of training. The main goal of this study was to implement Baldi as a language tutor for speech perception and production for hard of hearing individuals. The student’s ability to perceive and produce words involving the trained segments improved from pre-test to post-test. A second analysis revealed an improvement in production no matter which training method was used (e.g. vocal cord vibration and turbulent airflow vs. slowed down speech with multiple internal articulatory views vs. a combination of both methods). The present findings suggest that Baldi is an effective tutor for speech training hard of hearing students. There are other advantages of Baldi that were not exploited in the present study. Baldi can be accessed at any time, used as frequently as wished and modified to suit individual needs. Baldi also proved beneficial even though students in this study were continually receiving speech training with their regular and speech teachers before, during and after this
The Psychology and Technology of Talking Heads
209
Figure 9.12. Supplementary features indicating from left to right, vocal cord vibration, frication as in /s/, and nasal as in /n/ (the red nasal opening cannot be seen in the black and white illustration).
study took place. Baldi appears to offer unique features that can be added to the arsenal of Speech-language pathologists. The post-test productions were significantly better than the pre-test productions, indicating significant learning. Given that it is always possible that some of this learning occurred independently of our program or was simply based on routine practice, there is some evidence that at least some of the improvement must be due to our program. Follow-up ratings six weeks after our training was complete were significantly lower than post-test ratings, indicating some decrement due to lack of continued use. From these results we can conclude that our training program was a significant contributing factor to the change in ratings seen for production ability. Future studies can now focus on which specific training regimens for which contrasts are most effective.
5.6.2 Learning speech in a new language. A recent study by Massaro and Light [2003] demonstrated the effectiveness of Baldi for teaching non-native phonetic contrasts, by including instruction illustrating the internal articulatory processes of the oral cavity, as in the previous study with children with hearing loss. Eleven Japanese adult speakers of English as a second language were bimodally to identify and produce American English /r/ and /l/. The adults learned to produce these segments more accurately, indicating that Baldi holds great promise for second language learning.
5.7
Learning to Read
We now turn from speech reading to reading: the out-of-the-ordinary problems that a number of children encounter in learning to read and spell. Dyslexia
210
Advances in Natural Multimodal Dialogue Systems
is a category used to pigeonhole children who have much more difficulty in reading and spelling than would be expected from their other perceptual and cognitive abilities [Fleming, 1984; Willows et al., 1993]. Psychological science has established a tight relationship between the mastery of written language and the child’s ability to process spoken language [Morais and Kolinsky, 1994]. That is, it appears that many dyslexic children also have deficits in spoken language perception. The difficulty with spoken language can be alleviated through improving children’s perception of phonological distinctions and transitions, which in turn improves their ability to read and spell. Visible speech could only enhance the instruction of phonological awareness [Torgesen et al., 1999] and therefore offers provide another dimension of information for the children to use in identifying segments and mastering phonological awareness. Baldi can be embellished to signal characteristics of the speech signal that could aid in the teaching of phonological awareness. Figure 9.12 illustrates some potential features that could be displayed along with the typical information given by visible speech. Today, almost all personal computers have the capability to support a bimodal text-to-speech system, which would make it possible to incorporate bimodal speech in reading exercises. This treatment holds great promise, and we believe that adding visual speech will significantly enhance the positive results that have already been demonstrated with audible speech alone.
6.
Summary
Speech and language science and technology evolved under the assumption that speech was a solely auditory event. However, a burgeoning record of research findings reveals that our perception and understanding are influenced by a speaker’s face and accompanying gestures, as well as the actual sound of the speech. Perceivers expertly use these multiple sources of information to identify and interpret the language input. Given the value of face-to-face interaction, our persistent goal has been to develop, evaluate, and apply animated agents to produce realistic and accurate speech. Baldi is an accurate threedimensional animated talking head appropriately aligned with either synthesized or natural speech. Baldi has a realistic tongue and palate, which can be displayed by making his skin transparent. Based on this research and technology, we have implemented computerassisted speech and language tutors for children with language challenges and persons learning a second language. Our language-training program utilizes Baldi as the conversational agent, who guides students through a variety of exercises designed to teach vocabulary and grammar, to improve speech articulation, and to develop linguistic and phonological awareness. Some of the advantages of the Baldi pedagogy and technology include the popularity and
The Psychology and Technology of Talking Heads
211
effectiveness of computers and embodied conversational agents, the perpetual availability of the program, and individualized instruction. The science and technology of Baldi hold great promise in language learning, dialog, humanmachine interaction, education, and edutainment.
Acknowledgements The research and writing of the chapter were supported by the National Science Foundation (Grant No. CDA-9726363, Grant No. BCS-9905176, Grant No. IIS-0086107), Public Health Service (Grant No. PHS R01 DC00236), a Cure Autism Now Foundation Innovative Technology Award, and the University of California, Santa Cruz. The author is highly grateful for the dedication of the PSL team, particularly Michael Cohen, Alexis Bosseler, Joanna Light, Rashid Clark, and Slim Ouni.
References American Psychiatric Association (1994). Diagnostic and Statistical Manual of Mental Disorders Manual of Mental Disorders, DSM-IV. Washington, DC, 4th edition. Anderson, R. C. and Freebody, P. (1981). Vocabulary Knowledge. In Guthrie, J. T., editor, Comprehension and Teaching: Research Perspectives, pages 71–117. Newark, DE: International Reading Association. Baer, D. M., Wolf, M. M., and Risley, T. R. (1968). Some Current Dimensions of Applied Behavior Analysis. Journal of Applied Behavior Analysis, 1:91– 97. Barker, L. J. (2003). Computer-Assisted Vocabulary Acquisition: The CSLU Vocabulary Tutor in Oral-Deaf Education. Journal of Deaf Studies and Deaf Education, 8:187–198. Berninger, V. W. and Richards, T. L. (2002). Brain Literacy for Educators and Psychologists Educators and Psychologists. San Diego: Academic Press. Bosseler, A. and Massaro, D. W. (2003). Development and Evaluation of a Computer-Animated Tutor for Vocabulary and Language Learning for Children with Autism. Journal of Autism and Developmental Disorders, 33:653– 672. Cohen, M. M., Beskow, J., and Massaro, D. W. (1998). Recent Developments in Facial Animation: An Inside View. In Burnham, D., Robert-Ribes, J., and Vatikiotis-Bateson, E., editors, Proceedings of Auditory-Visual Speech Processing (AVSP), pages 201–206, Sydney, Australia. Cohen, M. M., Clark, R., and Massaro, D. W. (2001). Animated Speech: Research Progress and Applications. In Massaro, D. W., Light, J., and Geraci, K., editors, Proceedings of Auditory-Visual Speech Processing (AVSP), page 201, Aalborg, Denmark. Santa Cruz, CA: Perceptual Science Laboratory.
212
Advances in Natural Multimodal Dialogue Systems
Cohen, M. M. and Massaro, D. W. (1993). Modeling Coarticulation in Synthetic Visual Speech. In Thalmann, M. and Thalmann, D., editors, Computer Animation ’93, pages 139–156. Tokyo: Springer Verlag. Cohen, M. M., Massaro, D. W., and Clark, R. (2002). Training a Talking Head. In Proceedings of Fourth International Conference on Multimodal Interfaces (ICMI), pages 499–510, Pittsburgh, Pennsylvannia, USA. Cohen, M. M., Walker, R. L., and Massaro, D. W. (1996). Perception of Synthetic Visual Speech. In Stork, D. G. and Hennecke, M. E., editors, Speechreading by Humans and Machines, pages 153–168. New York: Springer. Cosi, P., Cohen, M. M., and Massaro, D. W. (2002). Baldini: Baldi Speaks Italian. In Proceedings of the Seventh International Conference on Spoken Language Processing (ICSLP), pages 2349–2352, Denver Colorado. Courchesne, E., Townsend, J., Ashoomoff, N. A., Yeung-Courchesne, R., Press, G., Murakami, J., Lincoln, A., James, H., Saitoh, O., Haas, R., and Schreibman, L. (1994). A New Finding in Autism: Impairment in Shifting Attention. In Broman, S. H. and Grafman, J., editors, Atypical Cognitive Deficits in Developmental Disorders: Implications for Brain Function, pages 101– 137. Hillsdale, NJ: Lawrence Erlbaum. Denes, P .B. and Pinson, E. N. (1963). The Speech Chain. In The Physics and Biology of Spoken Language. New York: Bell Telephone Laboratories. Fleming, E. (1984). Believe the Heart. San Francisco: Strawberry Hill Press. Guenter, B., Grimm, C., Wood, D., Malvar, H., and Pighin, F. (1998). Making Faces. In Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 55–66. NewYork, NY: ACM Press. Horner, R. D. and Baer, D. M. (1978). Multiple-Probe Technique: A Variation of the Multiple Baseline. Journal of Applied Behavior Analysis, 11:189– 196. Jesse, A., Vrignaud, N., and Massaro, D. W. (2000/2001). The Processing of Information from Multiple Sources in Simultaneous Interpreting. Interpreting, 5:95–115. Kent, J. R., Carlson, W. E., and Parent, R. E. (1992). Shape Transformation for Polyhedral Objects. In Proceedings of ACM SIGGRAPH Computer Graphics, volume 26:2, pages 47–54. New York, NY: ACM Press. Kirk, K. I., Pisoni, D. B., and Miyamoto, R. C. (1997). Effects of Stimulus Variability on Speech Perception in Listeners with Hearing Impairment. Journal of Speech & Hearing Research, 40:1395–1405. Loftus, G. R. (1978). On Interpretation of Interactions. Memory & Cognition, 6:312–319. Massaro, D. W. (1987). Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry. Hillsdale, NJ: Lawrence Erlbaum.
The Psychology and Technology of Talking Heads
213
Massaro, D. W. (1998). Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. Cambridge, Massachusetts: MIT Press. Massaro, D. W. (1999). From Theory to Practice: Rewards and Challenges. In Proceedings of the International Conference of Phonetic Sciences, pages 1289–1292, San Francisco, CA. Massaro, D. W. (2000). From “Speech is Special” to Talking Heads in Language Learning. In Proceedings of Integrating Speech Technology in the (Language) Learning and Assistive Interface (InSTIL), pages 153–161, Dundee, Scotland. Massaro, D. W. (2002). Multimodal Speech Perception: A Paradigm for Speech Science. In Granström, B., House, D., and Karlsson, I., editors, Multimodality in Language and Speech Systems, pages 45–71. Kluwer Academic Publishers, Dordrecht, The Netherlands. Massaro, D. W. and Bosseler, A. (2003). Perceiving Speech by Ear and Eye: Multimodal Integration by Children with Autism. The Journal of Developmental and Learning Disorders, 7:111–146. Massaro, D. W., Bosseler, A., and Light, J. (2003). Development and Evaluation of a Computer-Animated Tutor for Language and Vocabulary Learning. In Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona, Spain. Massaro, D. W., Cohen, M. M., Campbell, C. S., and Rodriguez, T. (2001). Bayes Factor of Model Selection Validates FLMP. Psychonomic Bulletin & Review, 8:1–17. Massaro, D. W., Cohen, M. M., Tabain, M., Beskow, J., and Clark, R. (2005). Animated Speech: Research Progress and Applications. In Vatiokis-Bateson, E., Bailly, G., and Perrier, P., editors, Audiovisual Speech Processing. Cambridge: MIT Press. In press. Massaro, D. W. and Light, J. (2004a). Improving the Vocabulary of Children with Hearing Loss. Volta Review, 104(3):141–174. Massaro, D. W. and Light, J. (2004b). Using Visible Speech for Training Perception and Production of Speech for Hard of Hearing Individuals. Journal of Speech, Language, and Hearing Research, 47(2):304–320. McKeown, M., Beck, I., Omanson, R., and Pople, M. (1985). Some Effects of the Nature and Frequency of Vocabulary Instruction on the Knowledge and Use of Words. Reading Research Quarterly, 20:522–535. Morais, J. and Kolinsky, R. (1994). Perception and Awareness in Phonological Processing: The Case of the Phoneme. Cognition, 50:287–297. Ouni, S., Massaro, D. W., Cohen, M. M, Young, K., and Jesse, A. (2003). Internationalization of a Talking Head. In Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS), Barcelona, Spain. Parke, F. I. (1975). A Model for Human Faces that allows Speech Synchronized Animation. Computers and Graphics Journal, 1:1–4.
214
Advances in Natural Multimodal Dialogue Systems
Schopler, E., Mezibov, G. B., and Hearsey, K. (1995). Structured Teaching in the TEACCH System. In Schopler, E., Mesibov, G. B., and Hearsey, K., editors, Learning and Cognition in Autism. Current Issues in Autism, pages 243–268. New York: Plenum Press. Stahl, S. A. (1986). Three Principals of Effective Vocabulary Instruction. Journal of Reading, 29:662–668. Tager-Flusberg, H. (1999). A Psychological Approach to Understanding the Social and Language Impairments in Autism. International Review of Psychiatry, 11:355–334. Tager-Flusberg, H. (2000). Language Development in Children with Autism. In Menn, L. and Ratner, N. B., editors, Methods For Studying Language Production, pages 313–332. New Jersey: Mahwah. Torgesen, J. K., Wagner, R. K., Rashotte, C. A., Lindamood, P., Conway, E. Rose T., and Garvan, C. (1999). Preventing Reading Failure in Young Children with Phonological Processing Disabilities: Group and Individual Responses to Instruction. Journal of Educational Psychology, 91:579–593. Trychin, S. (1997). Guidelines for Providing Mental Health Services to People who are Hard of Hearing. Technical report, Gallaudet University, Washington D.C. van Lancker, D., Cornelius, C., and Needleman, R. (1991). Comprehension of Verbal Terms for Emotions in Normal, Autistic, and Schizophrenic Children. Developmental Neuropsychology, 7:1–18. Waxman, S. R. (2002). Early Word-Learning and Conceptual Development: Everything had a Name, and each Name Gave Birth to a New Thought. In Goswami, U., editor, Handbook of Childhood Cognitive Development, pages 102–126. Malden, MA: Blackwell Publishing. Willows, D. M., Kruk, R. S., and Corcos, E., editors (1993). Visual Processes in Reading and Reading Disabilities. Hillsdale, NJ: Lawrence Erlbaum. Wood, J. (2001). Can Software Support Children’s Vocabulary Development? Language Learning & Technology, 5:166–201.
Chapter 10 EFFECTIVE INTERACTION WITH TALKING ANIMATED AGENTS IN DIALOGUE SYSTEMS Björn Granström and David House Centre for Speech Technology (CTT) Department of Speech, Music and Hearing, KTH, Stockholm, Sweden
{bjorn, davidh}@speech.kth.se Abstract
At the Centre for Speech Technology at KTH, we have for the past several years been developing spoken dialogue applications that include animated talking agents. Our motivation for moving into audiovisual output is to investigate the advantages of multimodality in human-system communication. While the mainstream character animation area has focussed on the naturalness and realism of the animated agents, our primary concern has been the possible increase of intelligibility and efficiency of interaction resulting from the addition of a talking face. In our first dialogue system, Waxholm, the agent used the deictic function of indicating specific information on the screen by eye gaze. In another project, Synface, we were specifically concerned with the advantages in intelligibility that a talking face could provide. In recent studies we have investigated the use of facial gesture cues to convey such dialogue-related functions as feedback and turn-taking as well as prosodic functions such as prominence. Results show that cues such as eyebrow and head movement can independently signal prominence. Current results also indicate that there can be considerable differences in cue strengths among visual cues such as smiling and nodding and that such cues can contribute in an additive manner together with auditory prosody as cues to different dialogue functions. Results from some of these studies are presented in the chapter along with examples of spoken dialogue applications using talking heads.
Keywords:
Audio-visual speech synthesis, talking heads, animated agents, spoken dialogue systems, visual prosody.
1.
Introduction
As we contribute to advances in spoken dialogue systems and see them being integrated into commercial products, we are witnessing a transformation 215 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 215–243. © 2005 Springer. Printed in the Netherlands.
216
Advances in Natural Multimodal Dialogue Systems
of the user interface, a transition from the desktop metaphor to the person metaphor. As we attempt to take advantage of the effective communication potential of human conversation, we see an increasing need to embody the conversational partner. A talking animated agent can provide the user with an interactive partner whose goal is to take the role of the human agent. An effective agent is one who is capable of supplying the user with relevant information, can fluently answer questions concerning complex information and can ultimately assist the user in a decision making process through the interactive flow of conversation. One way to achieve believability is through the use of a talking head where information is transformed through text into speech, articulator movements, speech related gestures and conversational gestures. Many other useful applications of talking heads include aids for the hearing impaired, educational software, audiovisual human perception experiments [Massaro, 1998], entertainment, and high-quality audio-visual text-to-speech synthesis for applications such as news reading. In this chapter we will focus on two main aspects of effective interaction in dialogue systems: presentation of information and the flow of interactive dialogue. Effectiveness in the presentation of information is crucial to the success of an interactive system. Information must be presented rapidly, succinctly and with high intelligibility. The use of the talking head aims at improving the intelligibility of speech synthesis through visual articulation and by providing the system with a visible location of the speech source to maintain the attention of the user. Important information can be selectively highlighted by prosodic enhancement and by the use of the agent’s gaze and visual prosody to create and maintain a common focus of attention. The use of the talking head also aims at increasing effectiveness by building on the user’s social skills to improve the flow of the dialogue. Visual cues to feedback, turn-taking and signalling the system’s internal state (the thinking metaphor) are key aspects of effective interaction. The talking head developed at KTH is based on text-to-speech synthesis. Audio speech synthesis is generated from a text representation in synchrony with visual articulator movements of the lips, tongue and jaw. Linguistic information in the text is used to generate visual cues for relevant prosodic categories such as prominence, phrasing and emphasis. These cues generally take the form of eyebrow and head movements which we have termed “visual prosody”. These types of visual cues with the addition of e.g. a smiling or frowning face are also used as conversational gestures to signal such things as positive or negative feedback, turn-taking regulation, and the system’s internal state. In addition, the head can visually signal attitudes and emotions. This chapter presents a brief overview and technical description of the KTH talking head explaining what the head can do and how. The two issues of intelligibility and communication interaction are discussed and exemplified by
Effective Interaction with Talking Animated Agents in Dialogue Systems
Figure 10.1.
217
Some different versions of the KTH talking head.
results from intelligibility tests and perceptual evaluation experiments. Brief examples of some experimental applications in which the head is used are then described, and finally, ongoing work on the expressive agent including attitudes and emotions is discussed.
2.
The KTH Talking Head
Animated synthetic talking faces and characters have been developed using a number of different techniques and for a variety of purposes during the past two decades. Our approach is based on parameterised, deformable 3D facial models, controlled by rules within a text-to-speech framework [Carlson and Granström, 1997]. The rules generate the parameter tracks for the face from a representation of the text, taking coarticulation into account [Beskow, 1995]. We employ a generalised parameterisation technique to adapt a static 3D-wireframe of a face for visual speech animation [Beskow, 1997]. Based on concepts first introduced by Parke [1982], we define a set of parameters that will deform the wireframe by applying weighted transformations to its vertices. One critical difference from Parke’s system, however, is that we have de-coupled the model definitions from the animation engine, thereby greatly increasing flexibility, allowing models with different topologies to be parameterised and animated, and their parameterisation data to be stored together with the model geometry. The models are made up of polygon surfaces that are rendered in 3D using standard computer graphics techniques. The surfaces can be articulated and deformed under the control of a number of parameters. The parameters are designed to allow for intuitive interactive or rule-based control. For the purposes of animation, parameters can be roughly divided into two (overlapping) categories: those controlling speech articulation and those used for non-articulatory cues and emotions. The articulatory parameters include jaw rotation, lip rounding, bilabial occlusion, labiodental occlusion and tongue tip
218
Advances in Natural Multimodal Dialogue Systems
elevation. The non-articulatory category includes eyebrow raising, eyebrow shape, smile, gaze direction and head orientation. Furthermore, some of the articulatory parameters such as jaw rotation can be useful in signalling nonverbal elements such as certain emotions. The display can be chosen to show only the surfaces or the polygons for the different components of the face. The surfaces can be made (semi) transparent to display the internal parts of the model. The model presently contains a relatively crude tongue model primarily intended to provide realism as seen from the outside, through the mouth opening. A full 3D model of the internal speech organs is presently being developed for integration in the talking head [Engwall, 2001]. This capability of the model is especially useful in explaining non-visible articulations in the language learning situation [Cole et al., 1999; Massaro et al., 2003; Massaro and Light, 2003]. Several face models have been developed for different applications, some of them can be seen in Figure 10.1. All can be parametrically controlled by the same articulation rules. For stimuli preparation and explorative investigations, we have developed a control interface that allows fine-grained control over the trajectories for acoustic as well as visual parameters. The interface is implemented as an extension to the WaveSurfer application (http://www.speech.kth.se/wavesurfer and [Sjölander and Beskow, 2000]) which is a tool for recording, playing, editing, viewing, printing, and labelling audio data. The interface makes it possible to start with an utterance synthesised from text, with the articulatory parameters generated by rule, and then interactively edit the parameter tracks for F0, visual (non-articulatory) parameters as well as the durations of individual segments in the utterance to produce specific cues. An example of the user interface is shown in Figure 10.2. In the top box a text can be entered in Swedish or English. The selection of language triggers one of two separate text-to-speech systems with different phoneme definitions and rules, built in the Rulsys notation [Carlson and Granström, 1997]. The generated phonetic transcription can be edited. On pushing “Synthesize”, rule generated parameters will be created and displayed in different panes below. The selection of parameters is user controlled. The lower section contains segmentation and the acoustic waveform. A talking face is displayed in a separate window. The acoustic synthesis can be exchanged for a natural utterance and synchronised to the face synthesis on a segment by segment basis by running the face synthesis with phoneme durations from the natural utterance. This requires a segmentation of the natural utterance which can be done (semi)automatically in e.g. WaveSurfer. The combination of natural and synthetic speech is useful for different experiments on multimodal integration and has been used in the Synface/Teleface project (see below). In language learning applications it could be used to add to the naturalness of the tutor’s voice in cases when the acoustic synthesis is judged to be inappropriate.
Effective Interaction with Talking Animated Agents in Dialogue Systems
Figure 10.2. synthesis.
219
The WaveSurfer user interface for parametric manipulation of the multimodal
The parametric manipulation tool is used to experiment with and define gestures. Using this tool we have constructed a library of gestures that can be invoked via XML markup in the output text.
3.
Effectiveness in Intelligibility and Information Presentation
One of the more striking examples of improvement and effectiveness in speech intelligibility is taken from the Synface project which aims at improving telephone communication for the hearing impaired [Agelfors et al., 1998]. The results of a series of tests using VCV words and hearing impaired subjects showed a significant gain in intelligibility when the talking head was added to a natural voice. With the synthetic face, consonant identification improved from 29% to 54% correct responses. This compares to the 57% correct response result obtained by using the natural face. In certain cases, notably the consonants consisting of lip movement (i.e. the bilabial and labiodental consonants), the response results were in fact better for the synthetic face than for the natural face. This points to the possibility of using overarticulation strategies for the talking face in these kinds of applications. Recent results indicate that a certain degree of overarticulation can be advantageous in improving intelligibility [Beskow et al., 2002].
220
Advances in Natural Multimodal Dialogue Systems
Table 10.1. Parameters used for articulatory control of the face. The second column indicates which ones are adjusted in the experiments described here. Parameter Jaw rotation Labiodental occlusion Bilabial occlusion Lip rounding Lip protrusion Mouth spread Tongue tip elevation
Adjusted in experiment x
x x x
Similar intelligibility tests have been run using normal hearing subjects where the audio signal was degraded by adding white noise [Agelfors et al., 1998]. Similar results were obtained. For example, for a synthetic male voice, consonant identification improved from 31% without the face to 45% with the face. While the visual articulation is most probably the key factor contributing to this increase, we can speculate that the presence of visual information of the speech source can also contribute to increased intelligibility by sharpening the focus of attention of the subjects. Hearing impaired persons often subjectively report that some speakers are much easier to speech-read than others. It is reasonable to hypothesise that this variation depends on a large number of factors, such as rate of speaking, amplitude and dynamics of the articulatory movements, orofacial anatomy of the speaker, presence of facial hair and so on. Using traditional techniques, it is however difficult to isolate these factors to get a quantitative measure of their relative contribution to readability. In an attempt to address this issue, we employ a synthetic talking head that allows us to generate stimuli where each variable can be studied in isolation. In this section we focus on a factor that we will refer to as articulation strength, which is implemented as a global scaling of the amplitude of the articulatory movements. In one experiment the articulation strength has been adjusted by applying a global scaling factor to the parameters marked with an x in Table 10.1. They can all be varied between 25 to 200% of normal. Normal is defined as the default articulation produced by the rules, which are hand-tuned to match a target person’s articulation. The default parameters used are chosen to optimise perceived similarity between a target speaker and the synthetic faces. However, it is difficult to know whether optimal intelligibility is achieved in a lip-reading situation for hearingimpaired people. An informal experiment was pursued to find out the average preferred articulation strength and its variance. 24 subjects all closely con-
Effective Interaction with Talking Animated Agents in Dialogue Systems
221
nected to the field of aural rehabilitation either professionally or as hearing impaired were asked to choose the most intelligible face out of 8 recordings of the Swedish sentence “De skrattade mycket högt” (They laughed very loud). The recordings had 25, 50, 75, 100, 112, 125, 150 and 175 percent of the default strength of articulation, which is implemented as a global scaling of the amplitude of the articulatory movements. The amount of co-articulation is not altered. The average preferred hyper articulation was found to be 24%, given the task to optimise the subjective ability to lip-read. The highest and lowest preferred values were 150 and 90 % respectively with a standard deviation of 16%. The option of setting the articulation strength to the user’s subjective preference should be included in the Synface application. Experiment 1: Audiovisual Consonant Identification To test the possible quantitative impact of articulation strength, as defined in the previous section, we performed a VCV (vowel-consonant-vowel) test. Three different articulation strengths were used i.e. 75%, 100% and 125% of the (arbitrary) default articulation strength. Stimuli consisted of nonsense words in the form of VCV (vowel-consonant-vowel) combinations. 17 consonants were used: /p, b, m, f, v, t, d, n, s, l, r, k, g, ng, sj, tj, j/ in two symmetric vowel contexts /a, U/ yielding a total of 34 different VCV words. The task was to identify the consonant. (The consonants are given in Swedish orthography – the non-obvious IPA correspondences are: ng=/N/, sj=/Ê/, tj= /ç/. Each word was presented with each of the three levels of articulation strength. The list was randomised. To avoid starting and ending effects, five extra turns were inserted at the beginning and two at the end. Stimuli were presented audiovisually by the synthetic talking head. The audio was taken from the test material from the Teleface project recordings of natural speech from a male speaker [Agelfors et al., 1998]. The audio had previously been segmented and labelled, allowing us to generate control parameter tracks for facial animation using the visual synthesis rules. The nonsense words were presented in white masking noise at a signal-tonoise ratio of 3 dB. 24 subjects participated in the experiment. They were all undergraduate students at KTH. The experiments were run in plenary by presenting the animations on a large screen using an overhead projector. The subjects responded on pre-printed answer sheets. The mean results for the different conditions can be seen in Table 10.2. For the /a/ context there are only minor differences in the identification rate, while the results for the /U/ context are generally worse, especially for the 75% articulation rate condition. A plausible reason for this difference lies in the better visibility of tongue articulations in the more open /a/ vowel context than in the context of the rounded /U/ vowel context. It can also be speculated that
222
Advances in Natural Multimodal Dialogue Systems
Table 10.2. Percent correct consonant identification in the VCV test with respect to place of articulation, presented according to vowel context and articulation strength. Articulation strength
/aCa/
/UCU/
75% 100% 125%
78.7 75,2 80,9
50,5 62,2 58,1
Table 10.3. Judged naturalness compared to the default (100%) articulation strength. Articulation strength 75% 125%
Less natural
Equal
More natural
31.67 41.67
23.33 19.17
45.00 39.17
movements observed on the outside of the face can add to the superior readability of consonants in the /a/ context. However, we could not find evidence for this in a study based on simultaneous recordings of face and tongue movements [Beskow et al., 2002; Engwall and Beskow, 2003]. Experiment 2: Rating of naturalness 18 sentences from the Teleface project [Agelfors et al., 1998] were used for a small preference test. Each sentence was played twice; once with standard articulation (100%) and once with smaller (75%) or greater (125%) articulation strength. The set of subjects, presentation method and noise masking of the audio was the same as in experiment 1 (the VCV test). The subjects were asked to report which of the two variants that seemed more natural or if they were judged to be of equal quality. The test consisted of 15 stimuli pairs, presented to 24 subjects. To avoid starting and ending effects, two extra pairs were inserted at the beginning and one at the end. The results can be seen in Table 10.3. The only significant preference was for the 75% version contrary to the initial informal experiment. However, the criterion in the initial test was readability rather than naturalness. The multi-modal synthesis software together with a control interface based on the WaveSurfer platform [Sjölander and Beskow, 2000] allows for the easy production of material addressing the articulation strength issue. There is a possible conflict in producing the most natural and the most easily lip-read face. However, under some conditions it might be favourable to trade off some naturalness for better readability. For example the dental viseme cluster [r, n, t, d, s, and l] could possibly gain discriminability in connection with closed vowels if tongue movements could be to some extent hyper-articulated and
Effective Interaction with Talking Animated Agents in Dialogue Systems
223
well rendered. Of course the closed vowels should be as open as possible without jeopardising the overall vowel discriminability. The optimum trade off between readability and naturalness is certainly also a personal characteristic. It seems likely that hearing impaired people would emphasize readability before naturalness. Therefore it could be considered that in certain applications like in the Synface software, the user could set the articulation strength at his own preference. Another quite different example of the contribution of the talking head to information presentation is taken from the results of perception studies in which the percept of emphasis and syllable prominence is enhanced by eyebrow and head movements. In an early study restricted to eyebrows and prominence [Granström et al., 1999] it was shown that raising the eyebrows alone during a particular syllable resulted in an increase in prominence judgments for the word in question by nearly 30%. In a later study, it was shown that eyebrows and head movement can serve as independent visual cues for prominence, and that synchronization of the visual movement with the audio speech syllable is an important factor [House et al., 2001]. Head movement was shown to be somewhat more salient for signalling prominence as eyebrow movement could be potentially misinterpreted as supplying non-linguistic information such as irony. In a study by Massaro and Beskow reported in [Massaro, 2002] where combinations of visual and auditory cues for prominence were tested, eyebrow movement was shown to be a strong correlate to the perception of prominence. A third example of information enhancement by the visual modality is to be found in the Waxholm demonstrator and the AdApt system. In both these systems, the agent uses gaze to point to areas and objects on the screen, thereby strengthening the common focus of attention between the agent and the user. Although this type of information enhancement has not yet been formally evaluated in the context of these systems, it must be seen as an important potential for improving the effectiveness of interaction. Finally, an important example of the addition of information through the visual modality is to be found in the August system. This involved adding mood, emotion and attitude to the agent. To enable display of the agent’s different moods, six basic emotions similar to the six universal emotions defined by Ekman [1979] were designed (Figure 10.12), inspired by the description in [Pelachaud et al., 1996]. Using these principles, a number of utterances in the system were hand-crafted to display appropriate emotional gestures.
4.
Effectiveness in Interaction
The use of a believable talking head can trigger the user’s social skills such as using greetings, addressing the agent by name, and generally socially chat-
224
Advances in Natural Multimodal Dialogue Systems
ting with the agent. This was clearly shown by the results of the public use of the August system during a period of six months [Bell and Gustafson, 1999]. These promising results have led to more specific studies on visual cues for feedback [Granström et al., 2002a] in which smile, for example, was found to be the strongest cue for affirmative feedback. Further detailed work on turn-taking regulation, feedback seeking and giving and the signalling of the system’s internal state will enable us to improve the gesture library available for the animated talking head and continue to improve the effectiveness of multimodal dialogue systems. One of the central claims in many theories of conversation is that dialogue partners seek and provide evidence about the success of their interaction [Clark and Schaeffer, 1989; Traum, 1994; Brennan, 1990]. That is, partners tend to follow a proof procedure to check whether their utterances were understood correctly or not and constantly exchange specific forms of feedback that can be affirmative (‘go on’) or negative (‘do not go on’). Previous research has brought to light that conversation partners can monitor the dialogue this way on the basis of at least two kinds of features not encoded in the lexico-syntactic structure of a sentence: namely, prosodic and visual features. First, utterances that function as negative signals appear to differ prosodically from affirmative ones in that they are produced with more ‘marked’ settings (e.g. higher, louder, slower) [Shimojima et al., 2002; Krahmer et al., 2002b]. Second, other studies reveal that, in face-to-face interactions, people signal by means of facial expressions and specific body gestures whether or not an utterance was correctly understood [Gill et al., 1999]. Given that current spoken dialogue systems are prone to error, mainly because of problems in the automatic speech recognition (ASR) engine of these systems, a sophisticated use of feedback cues from the system to the user is potentially very helpful to improve human-machine interactions as well, e.g. [Hirschberg et al., 2001]. There are currently a number of advanced multimodal user interfaces in the form of talking heads that can generate audiovisual speech along with different facial expressions [Beskow, 1995; Beskow et al., 2000; Beskow et al., 2001; Granström et al., 2001]. However, while such interfaces can be accurately modified in terms of a number of prosodic and visual parameters, there are as yet no formal models that make explicit how exactly these need to be manipulated to synthesize convincing affirmative and negative cues. One interesting question, for instance, is what the strength relation is between the potential prosodic and visual cues. The interaction between acoustic intonational gestures (F0) and eyebrow movements has been studied in production in e.g. [Cavé et al., 1996] and perception [Massaro, 2002]. A preliminary hypothesis is that a direct coupling is very unnatural, but that prominence and eyebrow movement may co-occur. In the experiment investigating the contribution of eyebrow movement to the perception of prominence in Swedish
Effective Interaction with Talking Animated Agents in Dialogue Systems
225
referred to above [Granström et al., 1999], words and syllables with concomitant eyebrow movement were perceived as more prominent than syllables without the movement. In addition, other research on multimodal cues for prominence [House et al., 2001; Krahmer et al., 2002a], has shown that there may be subtle interactions between visual and prosodic modalities on subjects’ perception of spoken stimuli, so that it may also be the case that prosodic and visual cues interact when used for backchannelling, see also [Massaro et al., 1996]. The goal of the research presented in this section is to gain more insight into the relative importance of specific prosodic and visual parameters for giving feedback on the success of the interaction. In the research presented below, use is made of a talking head whose prosodic and visual features are orthogonally varied in order to create stimuli that are presented to subjects who have to respond to these stimuli and judge them as affirmative or negative backchannelling signals.
4.1
Stimuli
The stimuli consisted of an exchange between a human, who was intended to represent a client, and the face, representing a travel agent. An observer of these stimuli could only hear the client’s voice, but could both see and hear the face. The human utterance was a natural speech recording and was exactly the same in all exchanges, whereas the speech and the facial expressions of the travel agent were synthetic and variable. The fragment that was manipulated, always consisted of the following two utterances: Human: Head:
“Jag vill åka från Stockholm till Linköping.” (“I want to go from Stockholm to Linköping.”) “Linköping.”
The stimuli were created by orthogonally varying 6 parameters (4 visual and 2 prosodic ones), using two possible settings for each parameter: one which was hypothesised to lead to affirmative feedback responses, and one which was hypothesised to lead to negative responses. For all stimuli, the head was given a neutral face during the time that the human was talking, with three eyeblinks at randomly chosen but natural intervals. The facial expressions changed during the head’s response utterance, through modifications of the following parameters shown in Table 10.4. The parameter settings were largely created by intuition and observing human productions. The smile was a gesture throughout the whole utterance, largely encompassing a widening of the mouth and a slight upwards movement of the mouth corners. The head movement for the affirmative setting was a short nod (300 ms) starting at the first vowel. The negative setting comprised a rise of the head throughout the whole utterance. The eyebrow rise for the
226
Advances in Natural Multimodal Dialogue Systems
Table 10.4. Different parameters and parameter settings used to create different stimuli. Affirmative setting Smile Head move Eyebrows Eye closure F0 contour Delay
Head smiles Head nods Eyebrows rise Eyes narrow slightly Declarative Immediate reply
Figure 10.3.
Negative setting Neutral expression Head leans back Eyebrows frown Eyes open widely Interrogative Delayed reply
F0 contours of the test word Linköping.
affirmative setting was initiated at the start of the utterance, being at its maximum from the start of the second syllable to the end. The eyebrow frown for the negative setting was an immediate frown from the beginning of the utterance which extended throughout the utterance. The affirmative gesture for the eye closure was a short (250 ms) narrowing of the eyes starting in the middle of the first vowel. The negative gesture was a widening of the eyes during the entire utterance. The affirmative and negative F0 contours were based on two natural utterances (see Figure 10.3). The delay for the negative setting was one second longer (1150 ms) compared to the essentially immediate response (150 ms) for the affirmative setting. All combinations of the two settings for the 6 parameters led to a total of 64 stimuli, which were presented to listeners in a perception experiment (see below). In principle, we could have included at least two additional prosodic parameters in our test, tempo and loudness, since these have also been shown to signal affirmative and negative feedback. However, apart from the fact that this would have increased the number of stimuli considerably so that it would be difficult to present all of them in a single experiment, we decided not to take these into account because temporal modifications did not easily fit in
Effective Interaction with Talking Animated Agents in Dialogue Systems
227
Figure 10.4. The all-negative and all-affirmative faces sampled in the end of the first syllable of Linköping.
our orthogonal design, since just changing the tempo would basically have affected the speed of change in all other visual and prosodic parameters as well. Loudness was excluded since it was uncertain if loudness effects would be perceptible in a group experiment. Samples of the resulting stimuli are seen in Figure 10.4.
4.2
Testing
The actual testing was done via a group experiment using a projected image on a large screen. Listeners were told that they were going to see and hear a series of exchanges between a talking head, representing a travel agent, and a human who wants to make a booking with the agent (see example above). They had to imagine that they were standing beside the human, and they were witnessing a fragment of a larger dialogue exchange. Subjects were told that they could both see and hear the talking head, but only hear the human, and they were informed that the visual expression of the head and the pronunciation of ‘Linköping’ by the head varied, whereas the human utterance was the same in all conditions. Their task was to respond to this dialogue exchange in terms of whether the head signals that he understands and accepts the human utterance, or rather signals that the head is uncertain about the human utterance. In addition, they needed to express on a 5-point scale how confident they were about their response. They were asked to always give an answer, even if they did not have an intuition as to what the head was signalling. No feedback was given on the ‘correctness’ of the responses; the stimuli were presented in a randomized order. Each stimulus was presented only once. The silent interval between two consecutive stimuli was 4.5 sec. The interval between the onset of each stimulus was either about 7 or 8 seconds depending on the delay para-
228
Advances in Natural Multimodal Dialogue Systems
meter. Both the first three and the final two utterances were dummies, which were excluded from the analyses afterwards, to make sure that the stimuli were not biased by unwanted ‘list’ effects. All subjects were volunteers, recruited from KTH personnel. They were not paid for their contribution, but were given coffee and cake after the experiment. After excluding the responses from three subjects who made some unrecoverable errors on their answer sheets, the responses from 17 subjects could be retained for further analyses.
4.3
Results
As can be seen in Figure 10.5, subjects used the confidence ratings in a nonrandom fashion. The numbers plotted in the figure are mean confidence ratings versus percent majority responses for each individual stimulus, i.e. 100% majority response means that all (17) subjects voted for an affirmative interpretation or all voted for the negative interpretation. The correlation coefficient is 0.63, and thus the confidence rating is used in the data analysis below.
Figure 10.5.
Mean confidence rating for the different stimuli and regression line (r=0.63).
There is a tendency for the stimuli to be judged as being more affirmative than negative, with four different stimuli receiving unanimously positive responses but with no stimuli receiving unanimously negative responses. Six of the subjects gave more than two-thirds negative responses, while one subject gave only affirmative responses. All subjects, however, used the full confidence scale from 1 to 5, and all results were thus retained in the analysis below.
229
Effective Interaction with Talking Animated Agents in Dialogue Systems
Table 10.5. Mean value for affirmative and negative settings of different parameters, mean difference value and corresponding F-statistics.
Smile F0 contour Eyebrow Head move Eye closure Delay
Affirmative
Negative
Diff. value
F(1,62)
p
η2
2.19 1.72 1.57 1.39 1.23 1.02
-0.33 0.14 0.29 0.47 0.64 0.84
2.52 1.58 1.28 0.92 0.59 0.18
61.18 15.07 9.06 4.33 1.74 <1
<.001 <.001 <.005 <.05 n.s. n.s.
.50 .20 .13 .07 -
The analyses presented are based on numbers that combine the different scores of the subjects, i.e. their yes/no responses and the confidence rating, in the following way: the responses to a stimulus as a negative or an affirmative cue were first reinterpreted as -1 or 1, respectively, and then multiplied by the confidence rating to obtain a score on a scale between -5 (very negative) and +5 (very affirmative). These latter numbers were analysed statistically via repeated measurements ANOVA’s run on each of the six parameters of our experimental design. Table 10.5 gives the mean values for each affirmative and negative setting, the value difference between the affirmative and negative settings, and the corresponding F-statistics. This table shows that 4 of the 6 parameters (Smile, F0 contour, Eyebrow and Head movement) have a significant effect on subjects’ responses, with affirmative settings leading to higher, positive values than the negative settings. The effects of Eye closure and Delay are not significant, but the trends observed in the means are clearly in the expected direction. There appears to be a strength order with Smile being the most important factor, followed by F0 contour, Eyebrow, Head movement, Eye closure and Delay. In Figure 10.6, the mean response value difference (from Table 10.5) for stimuli with the indicated cues set to their hypothesised affirmative setting and their negative setting is shown. The combined effect of cues is visualized in Figure 10.7. From left to right, the figure shows a monotone increase in affirmative judgments from stimuli that have only negative settings to stimuli that have only affirmative settings. Also in this case a bias towards affirmative responses can be observed. It is obviously not one single factor which has a predominant effect on subjects’ responses, but rather it is the case that subjects attend to combinations of features. Our research has shown that subjects are sensitive to both acoustic and visual parameters when they have to judge utterances as affirmative or negative feedback signals. The differences between cue strengths can be of interest
230
Advances in Natural Multimodal Dialogue Systems Cue strength
Average response value
3 2,5 2 1,5 1 0,5
cl
D
el ay
os ur e
t e
H
ea
d
Ey
m
ov
em
en
Ey eb ro w
co nt ou r F0
Sm ile
0
Average response value
Figure 10.6. The mean response value difference for stimuli with the indicated cues set to their affirmative and negative value.
4 3 2 1 0 -1 -2 -3 0
1
2
3
4
5
6
Number of affirmative cues
Figure 10.7.
The average response value for stimuli with different number of affirmative cues.
when implementing feedback signals in animated agents. It is noteworthy that the smile cue (a visual cue) contributed the most to the perception of affirmative feedback. Of all the visual cues used in the experiment, the smile is the one least likely to be associated with a prosodic function other than feedback, such as prominence. The other cues, especially brow raising and nodding, can potentially be associated with a prominence function as well as signalling feedback [House et al., 2001]. The fact that the brow frown functions as a negative cue is not surprising as the frown can signal confusion or disconcernment. Brow rise as an affirmative cue is more surprising in that a question or surprise can be accompanied by raised eyebrows. In this experiment, however, the brow rise was quite subtle. A larger raising movement is likely to be inter-
Effective Interaction with Talking Animated Agents in Dialogue Systems
231
preted as surprise. The fact that F0 was the second strongest cue demonstrates the importance of acoustic parameters for feedback in the multimodal environment. The relative importance of F0 may also have been enhanced by the shortness of the utterance. One obvious next step is to test whether the fluency of human-machine interactions is helped by the inclusion of such feedback cues in the dialogue management component of a system.
5.
Experimental Applications
During the past decade a number of experimental applications using the talking head have been developed at KTH. Four examples which will be discussed here are the Waxholm demonstrator system designed to provide tourist information on the Stockholm archipelago, the Synface project which is a visual hearing aid, the August project which was a dialogue system in public use, and the AdApt multimodal real-estate agent.
5.1
The Waxholm Demonstrator
The first KTH demonstrator application, which we named Waxholm, gives information on boat traffic in the Stockholm archipelago. It references timetables for a fleet of some twenty boats from the Waxholm company connecting about two hundred ports [Bertenstam et al., 1995]. Besides the dialogue management and the speech recognition and synthesis components, the system contains modules that handle graphic information such as pictures, maps, charts, and timetables. This information can be presented as a result of the user-initiated dialogue. The Waxholm system can be viewed as a micro-world, consisting of harbours with different facilities and with boats that you can take between them. The user gets graphic feedback in the form of tables complemented by speech synthesis. In the initial experiments, users were given a scenario with different numbers of subtasks to solve. A problem with this approach is that the users tend to use the same vocabulary as the text in the given scenario. We also observed that the user often did not get enough feedback to be able to decide if the system had the same interpretation of the dialogue as the user. To deal with these problems a graphical representation that visualises the Waxholm micro-world was implemented. An example is shown in Figure 10.8. One purpose of this was to give the subject an idea of what can be done with the system, without expressing it in words. The interface continuously feeds back the information that the system has obtained from the parsing of the subject’s utterance, such as time, departure port and so on. The interface is also meant to give a graphical view of the knowledge the subject has secured thus far, in the form of listings of hotels and so on.
232
Advances in Natural Multimodal Dialogue Systems
Figure 10.8.
The graphical model of the Waxholm micro-world.
The visual animated talking agent is an integral part of the system. This is expected to raise the intelligibility of the system’s responses and questions. Furthermore, the addition of the face into the dialogue system has many other exciting implications. Facial non-verbal signals can be used to support turntaking in the dialogue, and to direct the user’s attention in certain ways, e.g. by letting the head turn towards time tables, charts, etc. that appear on the screen during the dialogue. The dialogue system also provides an ideal framework for experiments with non-verbal communication and facial actions at the prosodic level, as discussed above, since the system has a much better knowledge of the discourse context than is the case in plain text-to-speech synthesis. To make the face more alive, one does not necessarily have to synthesise meaningful non-verbal facial actions. By introducing semi-random eyeblinks and very faint eye and head movements, the face looks much more active, and becomes more pleasant to watch. This is especially important when the face is not talking.
5.2
The Synface/Teleface Project
The speech intelligibility of talking animated agents, as the ones described above, has been tested within the Teleface project at KTH [Beskow et al., 1997; Agelfors et al., 1998]. The project has recently been continued/ex-
Effective Interaction with Talking Animated Agents in Dialogue Systems
233
Figure 10.9. Telephone interface for Synface.
panded in a European project, Synface [Granström et al., 2002b]. The project focuses on the usage of multi-modal speech technology for hearing-impaired persons. The aim of the first phase of the project was to evaluate the increased intelligibility hearing-impaired persons experience from an auditory signal when it is complemented by a synthesised face. In this case, techniques for combining natural speech with lip-synchronised face synthesis have been developed. A demonstrator of a system for telephony with a synthetic face that articulates in synchrony with a natural voice is currently being implemented (see Figure 10.9).
5.3
The August System
The Swedish author, August Strindberg, provided inspiration to create the animated talking agent used in a dialogue system that was on display during 1998 as part of the activities celebrating Stockholm as the Cultural Capital of Europe [Gustafson et al., 1999]. This dialogue system made it possible to combine several domains, thanks to the modular functionality of the architecture. Each domain has its own dialogue manager, and an example based topic spotter is used to relay the user utterances to the appropriate dialogue manager. In this system, the animated agent “August” presents different tasks such as taking the visitors on a trip through the Department of Speech, Music and Hearing, and giving street directions and also presenting short excerpts from the works of August Strindberg, when waiting for someone to talk to.
234
Advances in Natural Multimodal Dialogue Systems
Figure 10.10.
The agent Urban in the AdApt apartment domain.
August was placed, unattended, in a public area of Kulturhuset in the centre of Stockholm. One challenge is this very open situation with no explicit instructions being given to the visitor. A simple visual “visitor detector” makes August start talking about one of his knowledge domains.
5.4
The AdApt Multimodal Real-Estate Agent
The practical goal of the AdApt project is to build a system in which a user can collaborate with an animated agent to solve complicated tasks [Gustafson et al., 2000]. We have chosen a domain in which multimodal interaction is highly useful, and which is known to engage a wide variety of people in our surroundings, namely, finding available apartments in Stockholm. In the AdApt project, the agent has been given the role of asking questions and providing guidance by retrieving detailed authentic information about apartments. The user interface can be seen in Figure 10.10. Because of the conversational nature of the AdApt domain, the demand is great for appropriate interactive signals (both verbal and visual) for encouragement, affirmation, confirmation and turn-taking [Cassell et al., 2000; Pelachaud et al., 1996]. As generation of prosodically grammatical utterances (e.g. correct focus assignment with regard to the information structure and dialogue state) is also one of the goals of the system it is important to maintain
Effective Interaction with Talking Animated Agents in Dialogue Systems
235
modality consistency by simultaneous use of both visual and verbal prosodic and conversational cues [Nass and Gong, 1999]. As described in Section 1, we are at present developing an XML-based representation of such cues that facilitates description of both verbal and visual cues at the level of speech generation. These cues can be of varying range covering attitudinal settings appropriate for an entire sentence or conversational turn or be of a shorter nature like a qualifying comment to something just said. Cues relating to turn-taking or feedback need not be associated with speech acts but can occur during breaks in the conversation. Also in this case, it is important that there is a one-to-many relation between the symbols and the actual gesture implementation to avoid stereotypic agent behaviour. Currently a weighted random selection between different realizations is used.
6.
The Effective Agent as a Language Tutor
The effectiveness of language teaching is often contingent upon the ability of the teacher to create and maintain the interest and enthusiasm of the student. The success of second language learning is also dependent on the student having ample opportunity to work on oral proficiency training with a tutor. The implementation of animated agents as tutors in a multimodal spoken dialogue system for language training holds much promise towards fulfilling these goals. Different agents can be given different personalities and different roles, which should increase the interest of the students. Many students may also be less bashful about interacting with an agent who corrects their pronunciation errors than they would be making the same errors and interacting with a human teacher. Instructions to improve pronunciation often require reference to phonetics and articulation in such a way that is intuitively easy for the student to understand. An agent can demonstrate articulations by providing sagittal sections which reveal articulator movements normally hidden from the outside. This type of visual feedback is intended to both improve the learner’s perception of new language sounds and to help the learner in producing the corresponding articulatory gestures by internalising the relationships between the speech sounds and the gestures [Badin et al., 1998]. The articulator movements of such an agent can also be synchronised with natural speech at normal and slow speech rates. Furthermore, pronunciation training in the context of a dialogue automatically includes training of both individual phonemes and sentence prosody. As can be seen in Figure 10.11 the agent also provides different display possibilities for showing internal articulations. In learning a foreign language, visual signals may in many contexts be more important than verbal signals. During the process of acquiring a language, both child L1 speakers and adult L2 speakers rely on gestures to supplement their own speech production [McNeill, 1992; Gullberg, 1998]. Adult L2 speak-
236
Advances in Natural Multimodal Dialogue Systems
Figure 10.11. Different display possibilities for the talking head model. Different parts of the model can be displayed as polygons or smooth (semi) transparent surfaces to emphasise different parts of the model.
ers often make more extensive use of gestures than L1 speakers, especially when searching for words or phrases in the new language. In this context, gestures have a compensatory function in production, often substituting for an unknown word or phrase. L2 listeners may also make greater use of visual cues to aid the conversational flow than do L1 listeners. In this respect, parallels can be made between the situation of the hearing impaired listener and the L2 learner [McAllister, 1998]. It has been found [Kuhl et al., 1994; Burnham and Lau, 1999] that the integration of segmental audio-visual information is affected by the relationship between the language of the speaker and that of the listener. Subjects listening to a foreign language often incorporate visual information to a greater extent than do subjects listening to their own language. Furthermore, in a conversation, the L2 learner must not only concentrate on segmental phonological features of the target language while remembering newly learned lexical items, but must also respond to questions at the same time. This task creates a cognitive load for the L2 listener which is in many respects much different from that for the L1 user of a spoken dialogue system. Thus, the compensatory possibilities of modality transforms and enhancements of the visual modality are well worth exploring not only concerning segmental, phoneme-level information but also for prosodic and conversational information. In the experiment mentioned earlier in Section 3 investigating the contribution of eyebrow movement to the perception of prominence in Swedish [Granström et al., 1999] words and syllables with concomitant eyebrow movement were generally perceived as more prominent than syllables without the movement. This tendency was even greater for a subgroup of
Effective Interaction with Talking Animated Agents in Dialogue Systems
237
L2 listeners. For the acoustically neutral test sentence the mean increase in prominence response following an eyebrow movement was 24 percent for the Swedish L1 listeners and 39 percent for the L2 group. This type of multimodal perception of prominence can also be important for training lexical stress. In a study involving thirteen students of technical English (mostly native speakers of Swedish) the students first read a text containing words known to cause stress placement problems for Swedish speakers of English [Hincks, 2002]. The words were also selected because they often appear in technical contexts and are cognates which differ between English and Swedish primarily in the location of lexical stress. After making the recordings the students used the WaveSurfer graphical interface first to synthesize the Swedish words and then to alter the stress location from Swedish to English by manipulating duration and fundamental frequency. The students were also told to place a head nod on the correct stressed syllable. Post-test recordings made four weeks after the exercise showed a general improvement in correctly stressed syllables in the students’ production of the test words from about 35% (pre-test) to 70% correct (post-test). While these experiments have addressed the issue of multimodal prominence signalling, conversational signals with their communicative functions are of importance in the language learning context, not only to facilitate the flow of the conversation but also to facilitate the actual learning experience. It is therefore crucial that visual and verbal signals for encouragement, affirmation, confirmation and turn-taking function credibly in a multimodal system for language learning.
7.
Experiments and 3D Recordings for the Expressive Agent
A growing awareness of the importance of signalling emotions and attitudes in an “expressive” agent has led to an increase in research activity and new projects in related areas. One such European project, “PF-STAR” (Preparing future multisensorial interaction research) has as one of its goals the definition and assessment of a technological baseline for believable virtual agents in the form of talking heads, which produce speech and communicate emotions using audiovisual speech synthesis. An example of the August agent showing different (static) emotions is illustrated in Figure 10.12. The use of interactive talking agents in easy-to-use, intuitive man-machine interaction, and in human-human communication supported by synthetic faces calls not only for basic emotional expressions, but also for attitudinal expressions and conversational signals for, e.g., feedback and turn taking. To be able to create a realistic animated agent capable of emotional and attitudinal expressions it is important to gather data from human to human
238
Advances in Natural Multimodal Dialogue Systems
Figure 10.12. August showing different emotions (from top left to bottom right): Happiness, Anger, Fear, Surprise, Disgust and Sadness.
communications. Different methods of data collection are being investigated including video recordings of dialogue sessions combined with 3D recordings of reflective points placed in strategic positions on the interlocutor’s face and upper body (Figure 10.13). Automatic methods are being established for transferring this 3D data to the face model. Based on available and collected data, the generation models need to be augmented to handle the complex interaction/integration of the linguistic and extralinguistic signals. This calls both for improvements to actual production models, to create a higher degree of realism, and for new strategies for controlling them while, e.g., combining articulation with different emotions. From a human-agent interaction perspective, we have also established an experimental setting where human users’ facial expressions are recorded under varying conditions of the agent expression ranging from no facial expression to helpful and happy (smiling and positive visual cues) on the one hand to worried and irritated (frowning and negative visual cues) on the other. By keeping the dialog scenario constant, we are able to gather a range of user speech and facial expressions under the different agent conditions. These types of studies will enable us to evaluate the agent expressions in terms of dialog effectiveness. This will contribute to the building and evaluation of the gesture library in which gestures to be used in spoken dialog systems for controlling the facial expression of emotions, attitudes and communicative signals are stored, weighted, and selected. Secondly, we will be
Effective Interaction with Talking Animated Agents in Dialogue Systems
Figure 10.13.
239
Subject with reflective points used for 3D recordings of facial movements.
able to benefit from the database of human facial expressions collected under a very controlled setting.
Acknowledgements The work reported here was carried out by a great number of researchers at the Centre for Speech Technology, a competence centre at KTH, supported by VINNOVA (The Swedish Agency for Innovation Systems), KTH and participating Swedish companies and organizations. Marc Swerts collaborated on the feedback study while he was a guest at CTT. We also appreciated the constructive review by Dominic Massaro.
References Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K.-E., and Öhman, T. (1998). Synthetic Faces as a Lipreading Support. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 3047–3050, Sydney, Australia.
240
Advances in Natural Multimodal Dialogue Systems
Badin, P., Bailly, G., and Boë, L-J. (1998). Towards the Use of a Virtual Talking Head and of Speech Mapping Tools for Pronunciation Training. In Proceedings of ESCA Workshop on Speech Technology in Language Learning (STiLL), pages 167–170. Stockholm: KTH. Bell, L. and Gustafson, J. (1999). Utterance Types in the August System. In Proceedings of the ESCA Tutorial and Research Workshop on Interactive Dialogue in Multi-Modal Systems (IDS), pages 81–84. Bertenstam, J., Beskow, J., Blomberg, M., Carlson, R., Elenius, K., Granström, B., Gustafson, J., Hunnicutt, S., Högberg, J., Lindell, R., Neovius, L., de Serpa Leitao, A., Nord, L., and Ström, N. (1995). The Waxholm System - A Progress Report. In Proceedings of ESCA Workshop on Spoken Dialogue Systems, pages 81–84, Vigsø, Denmark. Beskow, J. (1995). Rule-Based Visual Speech Synthesis. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech), pages 299–302, Madrid, Spain. Beskow, J. (1997). Animation of Talking Agents. In Proceedings of ESCA Workshop on Audio-Visual Speech Processing (AVSP), pages 149–152, Rhodes, Greece. Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K.-E., and Öhman, T. (1997). The Teleface Project - Multimodal Speech Communication for the Hearing Impaired. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech), pages 2003–2006, Rhodes, Greece. Beskow, J., Granström, B., and House, D. (2001). A Multimodal Speech Synthesis Tool Applied to Audio-Visual Prosody. In Keller, E., Bailly, G., Monaghan, A., Terken, J., and Huckvale, M., editors, Improvements in Speech Synthesis, pages 372–382. New York: John Wiley & Sons, Inc. Beskow, J., Granström, B., House, D., and Lundeberg, M. (2000). Experiments with Verbal and Visual Conversational Signals for an Automatic Language Tutor. In Proceedings of Integrating Speech Technology in the (Language) Learning and Assistive Interface (InSTIL), pages 138–142, Dundee, Scotland. Beskow, J., Granström, B., and Spens, K.-E. (2002). Articulation Strength - Readability Experiments with a Synthetic Talking Face. The Quarterly Progress and Status Report of the Department of Speech, Music and Hearing (TMH-QPSR), 44:97–100. http://www.speech.kth.se/qpsr/. Brennan, S. E. (1990). Seeking and Providing Evidence for Mutual Understanding. Stanford University, Stanford, CA. Unpublished doctoral dissertation. Burnham, D. and Lau, S. (1999). The Integration of Auditory and Visual Speech Information with Foreign Speakers: The Role of Expectancy. In Proceed-
Effective Interaction with Talking Animated Agents in Dialogue Systems
241
ings of Auditory-Visual Speech Processing (AVSP), pages 80–85, Santa Cruz, USA. Carlson, R. and Granström, B. (1997). Speech Synthesis. In Hardcastle, W. and Laver, J., editors, The Handbook of Phonetic Sciences, pages 768–788. Oxford: Blackwell Publishers Ltd. Cassell, J., Bickmore, T., Campbell, L., Hannes, V., and Yan, H. (2000). Human Conversation as a System Framework: Designing Embodied Conversational Agents. In Cassell, J., Sullivan, J., Prevost, S., and Churchill, E., editors, Embodied Conversational Agents, pages 29–63. Cambridge, MA: MIT Press. Cavé, C., Guaïtella, I., Bertrand, R., Santi, S., Harlay, F., and Espesser, R. (1996). About the Relationship between Eyebrow Movements and F0 Variations. In Bunnell, H. T. and Idsardi, W., editors, Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 2175–2178, Philadelphia, PA, USA. Clark, H. H. and Schaeffer, E. F. (1989). Contributing to Discourse. Cognitive Science, 13:259–294. Cole, R., Massaro, D. W., de Villiers, J., Rundle, B., Shobaki, K., Wouters, J., Cohen, M., Beskow, J., Stone, P., Connors, P., Tarachow, A., and Solcher, D. (1999). New Tools for Interactive Speech and Language Training: Using Animated Conversational Agents in the Classrooms of Profoundly Deaf Children. In Proceedings of ESCA/Socrates Workshop on Method and Tool Innovations for Speech Science Education (MATISSE), pages 45–52, London: University College London. Ekman, P. (1979). About Brows: Emotional and Conversational Signals. In von Cranach, M., Foppa, K., Lepinies, W., and Ploog, D., editors, Human Ethology: Claims and Limits of a New Discipline: Contributions to the Colloquium, pages 169–248. Cambridge: Cambridge University Press. Engwall, O. (2001). Making the Tongue Model Talk: Merging MRI and EMA Measurements. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech), pages 261–264, Aalborg, Denmark. Engwall, O. and Beskow, J. (2003). Resynthesis of 3D Tongue Movements from Facial Data. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech), pages 2261–2264, Geneva, Switzerland. Gill, S. P., Kawamori, M., Katagiri, Y., and Shimojima, A. (1999). Pragmatics of Body Moves. In Proceedings of Third International Cognitive Technology Conference, pages 345–358, San Francisco, USA. Granström, B., House, D., Beskow, J., and Lundeberg, M. (2001). Verbal and Visual Prosody in Multimodal Speech Perception. In Nordic Prosody VII, pages 77–87. Frankfurt: Peter Lang.
242
Advances in Natural Multimodal Dialogue Systems
Granström, B., House, D., and Lundeberg, M. (1999). Prosodic Cues in Multimodal Speech Perception. In Proceedings of the International Congress of Phonetic Sciences (ICPhS), pages 655–658, San Francisco, USA. Granström, B., House, D., and Swerts, M. G. (2002a). Multimodal Feedback Cues in Human-Machine Interactions. In Bel, B. and Marlien, I., editors, Proceedings of the Speech Prosody 2002 Conference, pages 347–350. Aixen-Provence: Laboratoire Parole et Langage. Granström, B., Karlsson, I., and Spens, K.-E. (2002b). SYNFACE – A Project Presentation. The Quarterly Progress and Status Report of the Department of Speech, Music and Hearing (TMH-QPSR), 44:93–96. http://www.speech.kth.se/qpsr/. Gullberg, M. (1998). Gesture as a Communication Strategy in Second Language Discourse. A study of Learners of French and Swedish. Lund: Lund University Press. Gustafson, J., Bell, L., Beskow, J., Boye, J., Carlson, R., Edlund, J., Granström, B., House, D., and Wirén, M. (2000). AdApt - A Multimodal Conversational Dialogue System in an Apartment Domain. In Proceedings of International Conference on Spoken Language Processing (ICSLP), volume 2, pages 134– 137, Beijing, China. Gustafson, J., Lindberg, N., and Lundeberg, M. (1999). The August Spoken Dialogue System. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech), pages 1151–1154, Budapest, Hungary. Hincks, R. (2002). Speech Synthesis for Teaching Lexical Stress. In Proceedings of Fonetik 2002. The Quarterly Progress and Status Report of the Department of Speech, Music and Hearing (TMH-QPSR), volume 44, pages 153–156. Stockholm: KTH. http://www.speech.kth.se/qpsr/. Hirschberg, J., Litman, D., and Swerts, M. (2001). Identifying User Corrections Automatically in Spoken Dialogue Systems. In Proceedings of The Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 208–215, Pittsburg, PA, USA. House, D., Beskow, J., and Granström, B. (2001). Timing and Interaction of Visual Cues for Prominence in Audiovisual Speech Perception. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech), pages 387–390, Aalborg, Denmark. Krahmer, E., Ruttkay, Z., Swerts, M., and Wesselink, W. (2002a). Pitch, Eyebrows and the Perception of Focus. In Bel, B. and Marlien, I., editors, Proceedings of the Speech Prosody 2002 Conference, pages 443–446, Aix-enProvence. Laboratoire Parole et Langage. Krahmer, E., Swerts, M., Theune, M., and Weegels, M. (2002b). The Dual of Denial: Two Uses of Disconfirmations in Dialogue and their Prosodic Correlates. Speech Communication, 36(1-2):133–145.
Effective Interaction with Talking Animated Agents in Dialogue Systems
243
Kuhl, P. K., Tsuzaki, M., Tohkura, Y., and Meltzoff, A. M. (1994). Human Processing of Auditory-Visual Information in Speech Perception: Potential for Multimodal Human-Machine Interfaces. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 539– 542, Yokohama, Japan. Massaro, D. W. (1998). Perceiving Talking Faces: From Speech Perception to a Behavioural Principle. Cambridge, MA: MIT Press. Massaro, D. W. (2002). Multimodal Speech Perception: A Paradigm for Speech Science. In Granström, B., House, D., and Karlsson, I., editors, Multimodality in Language and Speech Systems, pages 45–71. The Netherlands: Kluwer Academic Publishers. Massaro, D. W., Bosseler, A., and Light, J. (2003). Development and Evaluation of a Computer-Animated Tutor for Language and Vocabulary Learning. In 15th International Congress of Phonetic Sciences (ICPhS), pages 143–146, Barcelona, Spain. Massaro, D. W., Cohen, M. M., and Smeele, P. M. T. (1996). Perception of Asynchronous and Conflicting Visual and Auditory Speech. Journal of the Acoustical Society of America, 100:1777–1786. Massaro, D. W. and Light, J. (2003). Read My Tongue Movements: Bimodal Learning To Perceive and Produce Non-Native Speech /r/ and /l/. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech), pages 2249–2252, Geneva, Switzerland. McAllister, R. (1998). Second Language Perception and the Concept of Foreign Accent. In Proceedings of ESCA Workshop on Speech Technology in Language Learning (STiLL), pages 155–158, KTH, Stockholm. McNeill, D. (1992). Hand and mind: What Gestures Reveal about Thought. Chicago: University of Chicago Press. Nass, C. and Gong, L. (1999). Maximized Modality or Constrained Consistency? In Proceedings of Auditory-Visual Speech Processing (AVSP), pages 1–5, Santa Cruz, USA. Parke, F. I. (1982). Parameterized Models for Facial Animation. IEEE Computer Graphics, 2(9):61–68. Pelachaud, C., Badler, N. I., and Steedman, M. (1996). Generating Facial Expressions for Speech. Cognitive Science, 28:1–46. Shimojima, A., Katagiri, Y., Koiso, H., and Swerts, M. (2002). Informational and Dialogue-Coordinating Functions of Prosodic Features of Japanese Echoic Responses. Speech Communication, 36(1-2):113–132. Sjölander, K. and Beskow, J. (2000). WaveSurfer - an Open Source Speech Tool. In Proceedings of International Conference on Spoken Language Processing (ICSLP), volume 4, pages 464–467, Beijing, China. Traum, D. R. (1994). A Computational Theory of Grounding in Natural Language Conversation. PhD thesis, Rochester.
Chapter 11 CONTROLLING THE GAZE OF CONVERSATIONAL AGENTS Dirk Heylen, Ivo van Es, Anton Nijholt and Betsy van Dijk University of Twente, the Netherlands
{heylen, es, anijholt, bvdijk}@cs.utwente.nl Abstract
We report on a pilot experiment that investigated the effects of different eye gaze behaviours of a cartoon-like talking face on the quality of human-agent dialogues. We compared a version of the talking face that roughly implements some patterns of human-like behaviour with two other versions. In one of the other versions the shifts in gaze were kept minimal and in the other version the shifts would occur randomly. The talking face has a number of restrictions. There is no speech recognition, so questions and replies have to be typed in by the users of the systems. Despite this restriction we found that participants that conversed with the agent that behaved according to the human-like patterns appreciated the agent better than participants that conversed with the other agents. Conversations with the optimal version also proceeded more efficiently. Participants needed less time to complete their task.
Keywords:
Gaze, embodied conversational agents, human computer interaction.
1.
Introduction
Research on embodied conversational agents is carried out in order to improve models and implementations simulating aspects of human-like conversational behaviour as best as possible. Depending on the application or precise research aims, one might strive for the synthetic characters that one is building to be believable, trustworthy, likeable, human- and life-like. This involves, amongst other things, having the character display the appropriate signs of a changing mood, a recognisable personality and a rich emotional life. The actions that have to be carried out by agents in dialogue situations include the obvious language understanding and generation tasks: knowing how to carry out a conversation and all the types of conversational acts this 245 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 245–262. © 2005 Springer. Printed in the Netherlands.
246
Advances in Natural Multimodal Dialogue Systems
involves (openings, greetings, closings, repairs, asking a question, acknowledging, back-channelling, etc.) and also using all the different modalities, including body-language (posture, gesture, and facial expressions). Although embodied conversational agents are still far from perfect, some agents have already been developed that can perform quite a few of the functions that were listed above to a reasonable extent and that can be useful in practical applications (tutoring, for instance). The Cassell et al. [2000] collection provides a good overview of such systems. In our research laboratory we started to develop spoken dialogue systems some years ago. We built an interface to a database containing information on performances in the local theatres. Through natural language dialogue, people could obtain information about performances and order tickets. A second step involved reconstructing one of the theatres in 3D using VRML and design a virtual human, Karin, that embodies this dialogue system (Figure 11.1).
Figure 11.1.
The Virtual Music Center.
We first focused the attention on several aspects of the multi-modal presentation of information [Nijholt and Hulstijn, 2000]. We combined presentation of the information through the dialogue system with traditional desktop ways of presentation through tables, pop-up menus and we combined natural language interaction with keyboard and mouse input. We wanted our basic version to be web-accessible which, for reasons of efficiency, forced us at that time to leave out the speech recognition interface from this version. The dialogue agent, Karin, is placed behind an information desk. In the situation when visitors enter the virtual environment of the Virtual Music Center and approach Karin she will start to speak and ask whether there is anything she can do for the visitor. The browser screen looks as in Figure 11.2 with one part displaying the 3D environment, another part showing the dialogue window to the right and, below these two, a table presenting information about the performances as a result of the user queries. Visitors have to type in their dialogue part. Karin’s answers also appear in the dialogue window but will also be pronounced by a text-to-speech system.
Controlling the Gaze of Conversational Agents
Figure 11.2.
247
Karin in the virtual environment.
We have moved on to implement other types of embodied conversational agents that are designed to carry out other tasks like navigating the user through the virtual environment or agents that act as tutors. Besides the work we did on building other types of agents we have also tried to explore in more depth different cognitive and affective models of agents, including symbolic BDI models as well as neural network models. We have also worked on extending their communicative skills. Current work, as summarised in [Heylen et al., 2001], is concerned with several aspects of non-verbal behaviour including facial expressions, posture and gesture. This chapter deals with one kind of non-verbal behaviour: gaze. Gaze has been shown to serve a number of functions in human-human interaction [Kendon, 1990]. On a meta-conversational level, gaze helps to regulate the flow of conversation and plays an important role in ensuring smooth turntaking behaviour. Speakers, for instance, have the tendency to gaze away from listeners at potential turn-taking positions when they want to keep on talking. Listeners show continued attention when gazing at the speaker. On an interpersonal level, duration and types of gaze communicate the nature of the relationship between the interlocutors. This made us curious about our own situation with the agent Karin. We wondered whether implementing some kind of human-like rules for gaze behaviour would have any effects. We therefore set up an experiment. Although people are talking to an agent in a somewhat unnatural way – they are typing in
248
Advances in Natural Multimodal Dialogue Systems
their input, for instance – previous research had shown that people are sensitive to the gaze of such agents. In the next section of this chapter we will discuss some aspects of the function of gaze in face-to-face conversations between humans and in mediated forms of communication. Next we describe the experiment we did with our embodied agent Karin and discuss the results.
2. 2.1
Functions of Gaze General
The function of gaze in human-human, face-to-face dialogues has been studied quite extensively, see [Argyle and Cook, 1976] for an overview). The way speakers and hearers seek or avoid mutual eye contact, the function of looking to or away from the interlocutor, the timing of this behaviour in relation to aspects of discourse and information structure have all been investigated in great detail and certain typical patterns have been found to occur. In these investigations a lot of parameters like age, gender, personality traits, and aspects of interpersonal relationships like friendship or dominance, and the nature of the setting in which the conversation takes place have been considered. In trying to build life-like and human-like software agents that act as talking heads which humans can interact with as if they were talking face-to-face with another human, one is also led to consider the way the agents look away and towards the human interlocutor. This has been the concern of several researchers on embodied conversational agents. It is also related to work on forms of mediated human-human communication as in teleconferencing systems that make use of avatars, for instance. Previous research was mostly concerned with trying to describe an accurate computational model of gaze behaviour. A few evaluations of the effects of gaze on the quality of interactions in mediated conversation (mostly using avatars instead of autonomous agents) have been carried out by Vertegaal [1999], Garau et al. [2001], Colburn et al. [2000] and Thórisson and Cassell [1996], amongst others. These papers have shown that improving the gaze behaviour of agents or avatars in human-agent or humanavatar communication has noticeable effects on the way communication proceeds.
2.2
Human to Human
The amount of eye contact in a human-human encounter varies widely. Some of the sources of this variation as well as some typical patterns that occur have been identified. Women, for instance, are found to engage in eye contact more than men. Cultural differences account for part of the variation as well.
Controlling the Gaze of Conversational Agents
249
Argyle and Cook [1976] provide an extensive overview of these investigations. We will summarise some of the major findings here. When people in a conversation like each other or are cooperating there is more eye contact. When personal or cognitively demanding topics are discussed eye contact is avoided. Stressing the fact that the following figures are only averages and that wide variation is found, Argyle [1993] provides the following statistics on the percentage of time people look at one another in dyadic (two-person) conversations. About 75% of the time that they are listening coincides with gazing at the speaker. People that are talking look less of the time at the listener (40%). During a complete interaction there is eye-contact (mutual gaze) only 30% of the time. Among the common subjective interpretations of eye contact have been found friendship, sexual attraction, hate and a struggle for dominance. This list shows that the same behaviour can result in opposite valuations. The precise quality depends on the circumstances. The simplest inference a person can draw from the situation where someone is looking at him is that the other is paying attention. In a face-to-face conversation this is appreciated positively. However, in public places extensive gazing at strangers may be felt to be impolite or even threatening. There are individual differences in the amount and type of gaze depending, also, on personality traits. “Gaze levels are also higher in those who are extroverted, dominant or assertive, and socially skilled. Perception of eyes leads to arousal, and to avoidance after a certain period of exposure. The finding that extraverts (who have a low level of arousal) look in general more than introverts is consistent with this hypothesis” [Argyle and Cook, 1976, page 21]. People who look more tend to be perceived more favourably, other things being equal, and in particular as competent, friendly, credible, assertive and socially skilled [Kleinke, 1987]. Besides these more psychological or emotional signal functions of gaze, looking to the conversational partner also plays an important part in regulating the interaction. The patterns in turn taking behaviour and the relation to (mutual) gaze have been the subject of several investigations. Studying the patterns in gaze and turn-taking behaviour, Kendon [1990] was one of the first to look with some detail at how gaze behaviour operates in dyadic conversations. He distinguishes between two important functions of an individual’s perceptual activity in social interaction. By looking or not looking, a person can control the degree of monitoring his interlocutor and this choice can also have regulatory and expressive functions. Argyle and Dean [1972] report that in all investigations where this has been studied it has been found that there is more eye contact when the subject is listening than when he is speaking (cf. Table 11.1). Furthermore people look up at the end of their turn and/or at the end of phrases and look away at the start
250
Advances in Natural Multimodal Dialogue Systems
of (long) utterances, not necessarily resulting in mutual gaze or eye contact. The patterns in gaze behaviour are explained by a combination of principles. Speakers that start longer utterances tend to look away to concentrate on what they are saying, avoiding distraction, and to signal that they are taking the floor and do not want to be interrupted. At the end of a turn, speakers tend to look up to monitor the hearer’s reaction and to offer the floor. Table 11.1. Percentages of Gaze in Dyadic Conversations. Individual gaze While listening While talking Eye-contact
60 % 75 % 40 % 30 %
In [Cassell et al., 1999], the relation between gaze, turn-taking, and information structure is investigated in more detail. The empirical analysis shows the general pattern of looking away and looking towards the hearer at turnswitching positions. The main finding reported in this chapter, is that if the beginning of a turn starts with the thematic part (the part that links the utterance with previously uttered or contextualised information), then the speaker will always look away and when the end of the turn coincides with a rhematic part (that provides new information), then the speaker will always look towards the listener at the beginning of the rhematic part. In general, beginnings of themes and beginnings of rhemes are important places where looking away and looking towards movements occur. From these observations on the gaze behaviour one could derive some prognoses with respect to the effects of the design of gaze behaviour for an embodied conversational agent. The amount and type of gaze will influence how the character of the agent will be perceived. The patterns of gaze in relation to the discourse and information structure may lead to more or less efficient conversations. However, these are prognoses based on findings about human-human, face-to-face interaction. In the next subsection we look at some studies of gaze behaviour in mediated conversations between human interlocutors and in conversations between humans and agents.
2.3
Mediated Conversations
Several researchers have investigated the effects of implementing gaze behaviour in conversational agents or in other forms of mediated conversation between humans. In videoconferencing for instance, avatars may be used to represent the users. Vertegaal [1999] describes the GAZE groupware system in which participants are represented by simple avatars. Eye-tracking of the participants informs the direction in which the avatars look at each other on the screen so that
Controlling the Gaze of Conversational Agents
251
the avatars mimick the gazing behaviour of the participants, see also [Vertegaal et al., 2001]. They have shown, in experiments, that this improves such videoconferencing discussions in several ways. Garau et al. [2001] describe an experiment with dyadic conversation between humans in four mediated conditions: video, audio-only, random-gaze avatars and informed gaze avatars. In the latter case, gaze was related to conversational flow. The experiment showed that the random-gaze avatar did not improve on audio-only communication, whereas the informed gaze-avatar significantly outperformed audio-only on a number of response measures. Colburn et al. [2000] also describe some experiments in conversations between humans and avatars in a video-conferencing context. One of the questions they asked was whether users that interact with an avatar will act in ways that resemble human-human interaction or whether the knowledge that they are talking to an artificial agent counteracts natural reactions. In one experiment they changed the gaze behaviour of avatars during a conversation. It appeared from this and similar experiments that participants, though not consciously aware of the differences in the avatar’s gaze behaviour, still react differently (subliminally). In the context of embodied conversational agents, rules for gaze behaviour of agents have been studied by Cassell et al. [1994; 1999]. Algorithms and architectures for controlling the non-verbal behaviour of agents, including gaze, are also presented in [Chopra-Khullar and Badler, 1999] and [Novick et al., 1996]. Poggi et al. [2000] provide an interesting basis for implementing eye communication of agents, including gaze, by relating it to the communicative parameters that are involved in a face-to-face interaction, more specifically the conversational actions of the agents and their beliefs, desires and intentions. Most of these authors have focused on getting the appropriate computational models instead of on evaluation. Before we did our experiment, the work on evaluation of gaze behaviour in mediated communication had been concerned almost only with human-human conversations in videoconferencing and not, to any great extent, with conversations between human and autonomous embodied conversational agents. However, the evaluation work on human-controlled avatars and mediated conversation seemed to provide a promise for reasonable effects in mediated conversations with agents. Hence we were motivated to investigate the effects of implementing different gaze behaviour for Karin, our embodied conversational agent, even despite the fact that in this case the conversation is somewhat unnatural in that users have to type in their parts of the conversation and despite the fact that people may become distracted by the information presented in tables. Previous work on evaluation in this respect is reported in [Thórisson and Cassell, 1996]. They found that conversations with a gaze informed agent
252
Advances in Natural Multimodal Dialogue Systems
increased ease, believability and efficiency compared to a content-only agent and an agent that produced content and emotional emblems. Since we did our experiment, some more research on gaze has been published. Fukayama et al. [2002] have implemented a gaze movement model based on observations in the psychological literature. By systematically varying the parameters amount of gaze, mean duration of gaze and gaze points while averted, they have tried to influence the impression management of the agent. They set up an initial experiment to confirm the validity of the gaze movement model and found that the subjects could note the impressions they gained from an eyes-only agent moving its eyes based on their set of gaze parameters. In our pilot experiment described in the next section, we were not so much interested in the precise rules or the architecture of the system implementing the rules, but rather in the effects on dialogue quality that a simple implementation of the patterns might have. Some of the factors that we looked into are the efficiency of interactions, the way people judge the character of the agents and how they rate the quality of the conversation in general.
3. 3.1
The Experiment Participants, Task and Procedure
We had 48 participants in our experiment. They were all graduate students of the University of Twente, aged between 18 and 25, two thirds were male and one third female. These participants were randomly assigned to one of the three conditions, taking care that the male/female ratio was roughly the same for each. The participants were given instructions on paper to make reservations for two concerts. During the execution of the task they were left alone in a room monitored by two cameras. After they finished the task they filled out a questionnaire. The questionnaire together with the notes taken when observing the participants through the camera and the time it took for the participants to complete the task were used to evaluate the differences between the three versions of the agent.
3.2
Versions
In the web-accessible version of Karin and the 3D world, visitors have to enter the virtual environment and walk to the reception desk to talk to Karin. In the experiment we started the application so that the participants were positioned face-to-face with Karin immediately. We also left the dialogue box, in which Karin’s replies are normally typed out, blank to reduce distraction.
Controlling the Gaze of Conversational Agents
Figure 11.3.
253
Karin as presented in the experiment.
In a face-to-face conversation, people typically look at each other, at objects of mutual interest and blankly into space or at irrelevant objects [Argyle and Cook, 1976]. Karin in our experiment is capable of the following behaviours. 1 Gaze at the visitor (Gaze) 2 Look away from the visitor (Avert) 3 Look at the table of performances (Direct) The third behaviour refers to a table of performances that can appear below Karin (see Figure 11.2) when performances were retrieved in response of a user query. In this case Karin turns her eyes down to draw attention toward it. This occurs together with Karin saying: “Take a look at the table for a list of the performances.” or something similar.
Figure 11.4.
Various ways of looking away.
When Karin turns her eyes away from the interlocutor, she will mostly turn her eyes upwards1 as this is the most typical way of indicating a thinking mode [Poggi et al., 2000]. Several stylistic variations of looking away from the 1 People
typically have a bias towards one direction when averting their eyes in thinking mode. Karin will mostly turn her eyes leftwards.
254
Advances in Natural Multimodal Dialogue Systems
visitor were implemented. They are accompanied by tilting and rotating the head as is typical for the way humans look away as the pictures above show. In Table 11.2 a part of a typical conversation is given with indications of how Karin turns her eyes away and towards the human participant. We show the optimal and suboptimal version. Table 11.2. Sample Dialogue. K S K
S K
Hello, I’m Karin. What can I help you with? Hi. When is the next concert of X? Just a moment, while I look it up. There are 27 concerts. Take a look at the table For the dates. I want to book tickets for the concert on November 7. You want to make a reservation for the Lunch series. I have the following information for this series: 20 guilders normal rate. How many tickets do you want?
Optimal Avert Gaze
Suboptimal Gaze Gaze
Avert
Gaze
Direct Gaze
Direct Gaze
Avert
Gaze
Gaze Avert Gaze
Gaze Gaze Gaze
In the optimal version Karen averts her eyes at the beginning of a turn for a short period and then starts gazing again. In general Karin’s replies are quite short. But some consist of somewhat longer sequences. For instance, when she repeats the information she has so far and also adds a question to initiate the next step in the reservation. This is illustrated by the last reply. In that case, Karin averts her eyes from the speaker to indicate that she is not ready yet and does not want the user to take the turn. We have tried to time eye-movements and information-structure in accordance with the rules described by Cassell et al. [1999]: for each proposition in the list of propositions to be realized sequentially by a language generator – if current proposition is thematic ∗ if beginning of turn or distribution(.70) attach a look-away from the hearer ∗ endif – else if current proposition is rhematic ∗ if end of turn or distribution(.73) attach a look-toward the hearer
Controlling the Gaze of Conversational Agents
255
∗ endif endfor In this pseudo-code distribution(x) is a randomized function that returns true with probability x. We did not use this probability function in our algorithm. The length of the utterances was used to determine whether a gaze or avert action could take place. Because the dialogue grammar is not very extensive, we were able to mark it up so we could produce the results as one would expect from the algorithm. We introduced a second version in which Karin will only stop looking at the user when she directs the users with her eyes to the table with the performances. We will refer to this as the suboptimal version. Eye-movements are severely limited in this version. In such a version, there are thus no cues given by the eyes with respect to turn-structure. In the third version a random eye-movement action was chosen at each position at which a specific eye-movement change could occur in the optimal version. This means that there will be more eye-movements than in the suboptimal version, but the movements will not, in general, be related to the conventions described above. The pictures in Table 11.3 show some of the interaction between Karin (the optimal version) and one participant of the experiment. In the first shot and second shot we see Karin introducing herself. She gradually tilts her head to the left and turns her eyes away from the user. She immediately turns her head back and resumes eye-contact when she starts her second utterance: the question “What can I help you with?”. As soon as this starts the participant puts his fingers on the keyboard, waiting for Karin to end her turn, ready to start typing in his question, as can be seen in the third screen shot. Next the user types in his question, asking whether he can make a reservation. During a brief period, while Karin is asking about details, the user reads the instructions for details about the task he was given. The dialogue manager that takes care of the agent’s dialogue intelligence will ask a series of questions to get all the information needed to make a reservation. The last shot shows how Karin looks at the table that lists the performances that match the user’s query.
3.3
Measures
In general, we wanted to find out whether participants talking to the optimal version of Karin were more satisfied with the conversation than the other participants. We distinguished between several factors that could be judged: ease of use, satisfaction, involvement, efficiency, personality/character, naturalness of eye and head movements and mental load. Most of the measures
256
Advances in Natural Multimodal Dialogue Systems
Table 11.3. Screenshots from an interaction.
were judgements on a five point Likert scale (/). A selection of the questions asked is presented in Table 11.4.
Controlling the Gaze of Conversational Agents
257
Table 11.4. Sample Questions. Satisfaction I liked talking to Karin It takes Karin too long to respond The conversation had a clear structure I like ordering tickets this ways Ease of Use It is easy to get the right information It was clear what I had to ask/say It took a lot of trouble to order tickets Involvement I think I looked at Karin about as often as I look to interlocutors in normal conversations Karin keeps her distance It was always clear when Karin finished speaking Personality I trust Karin Karin is a friendly person Karin is quite bad tempered
Some factors were evaluated by taking other measures into account. The time it took to complete the tasks was used, for instance, to measure efficiency. We also asked participants some questions about the things said in the dialogue to judge differences in the attention they had paid to the task. We were not sure whether participants would be influenced a lot by the differences in the gaze behaviour. However, if there were any effects, we assumed that the optimal version would be most efficient, in that it signals turn-taking mimicking human patterns.
3.4
Results
Efficiency, measured in terms of the time to complete the tasks, was analyzed using a one-way ANOVA test. The table shows the time in minutes. A significant difference was found between the three groups (F(2,45)=3.80, p<.05). For means and corresponding standard deviations see Table 11.5. To find out which version was most efficient, the groups were compared two by two using t-tests (instead of post-hoc analysis). The optimal version was found to be significantly more efficient than the suboptimal version (t(30)=-2.31, p<.05, 1-tailed) and the random version (t(30)=-2.64, p<.01). No significant difference (at 5% level) was found between the suboptimal and the random version. The main effect of the experimental conditions on the other factors was analyzed using the Kruskal-Wallis test. Answers to questions were recoded such that for all factors the best possible score was 1 and the worst score was
258
Advances in Natural Multimodal Dialogue Systems
5. The results are summarized in Table 11.5. The table shows significant differences between the versions for ease of use, satisfaction and naturalness of head movement and a marginally significant difference for personality. Table 11.5. The main effects of experimental condition: means and standard deviations (in parentheses) of the factor scores and the results of the Kruskal-Wallis test. Factors
Opti
Sub
Ran
χ2
Ease of use
2.55 (1.31) 2.33 (1.20) 3.08 (1.35) 2.46 (1.21) 1.31 (.62) 1.13 (.39) 2.54 (1.27) 6.88 (2.00)
3.05 (1.30) 2.74 (1.29) 3.47 (1.28) 2.79 (1.27) 1.31 (.55) 1.13 (.49) 3.02 (1.31) 8.88 (2.83)
2.66 (1.17) 2.79 (1.20) 3.47 (1.17) 2.79 (1.14) 1.63 (.61) 1.29 (.58) 2.63 (1.20) 9.56 (3.56)
12.09∗
Satisfaction Involvement Personality Natural head movement Natural eye movement Attention Efficiency † ∗
9.63∗ 3.53 5.62† 11.66∗ 3.34 3.93 -
p<.10 p<.01
Two by two comparisons using Mann-Whitney tests pointed out that on the factor ease of use the optimal version was significantly better than the suboptimal version (U=6345, p<.001). Users of the optimal version were more satisfied than users of the suboptimal and the random version (resp. U=5140, p<.05 and U=4913.5, p<.01). On the factor personality the optimal version was better than the random version (U=5261.5, p<.05) and marginally better than the suboptimal version (U=5356.5, p<.10). Both the optimal and the suboptimal agent moved their head more naturally than the random agent (resp. U=805.5, p<.01 and U=823.5, p<.01). The eye movements were found to be marginally better in the optimal version than in the random version (U=1006, p<.10). On the factor attention the difference between the optimal version and the suboptimal version was marginally significant (U=910, p<.10). The other comparisons yielded no significant differences.
4.
Discussion
The table clearly shows that the optimal version performs best overall. We can thus conclude that even a crude implementation of gaze patterns in turn-
Controlling the Gaze of Conversational Agents
259
taking situations has significant effects. Not only do participants like the optimal version best (satisfaction), find it more easy to use and looking more natural, they also perform the tasks much faster. The more natural version is preferred above a version in which the eyes are fixed almost constantly and a version in which the eyes may move as much as in the optimal situation but do not follow the conventional patterns of gaze. To measure satisfaction participants were asked to rate how well they liked Karin and how they felt the conversation went in general besides some other questions that relate directly or indirectly to what can be called satisfaction. The participants of the optimal version were not only more satisfied with their version, but they also related more to Karin than the participants of the other versions did as they found her to be more friendly, helpful, trustworthy, and less distant. The differences between the optimal and the suboptimal version seem to correspond to patterns observed in human-human interaction. In the suboptimal version, Karin looks at the visitor almost constantly. Although in general it is the case that people who look more tend to be perceived more favourably, as mentioned above [Kleinke, 1987], in this case the suboptimal version, in which Karin looks at the participants the most of all the versions, is not the preferred one. This, however, is in line with a conclusion of Argyle et al. [1974] who point out that continuous gaze can result in negative evaluation of a conversation partner. This is probably the major explanation why Karin is perceived less favourably as a person in the suboptimal version (as compared to the optimal version). Note that Karin still looks at participants quite a lot in the optimal version as she only looks away at beginning of turns and at potential turn-taking positions when she wants to keep the turn, otherwise she will look at the listener while speaking. She also looks towards the interlocutor while listening. She therefore seems to have found an adequate equilibrium in gazing a lot to be liked but not too much. When participants have to evaluate how natural the faces behave it appears that the random version scored lower than the other versions but no differences could be noted between the optimal and suboptimal version. Making “the right” head and eye movements or almost no movements are both conceived of as being equally natural, whereas random movements are judged less natural. What is interesting, however, is that these explicit judgements on the life-likeness of the behaviour of the agents do not reflect directly judgments on other factors. The random version may be rated as less natural than the others but in general it does not perform worse than the suboptimal version. For the factor ease of use it is judged even significantly better than the suboptimal version. Does this mean that having regular movements of the eyes instead of almost fixed eyes is the important cue here? On the other hand, the difference in this rating (which is gotten from judgments on a question like “was it easy to order tickets”) is not in line with the real amount of time people actually
260
Advances in Natural Multimodal Dialogue Systems
spent on the task. Though the random version is judged easy to use, it takes the participants using it the most time to complete the tasks. The optimal version is clearly the most efficient in actual use. This gain in efficiency might be a result of the transparency of turn-taking signals; i.e. the flow of conversation may have improved as one would assume when regulators like gaze work appropriately. But the gain might also have been a result, indirectly, of the increased involvement in the conversation of the participants that used the optimal version. Whatever is cause or effect is difficult to say. We have an indication that the different gaze patterns had some impact not just on overall efficiency but also on the awareness of participants about when Karin was finishing her turn. We have some rough figures on the number of times participants started their turn before Karin was finished with hers. In almost all of these cases this slowed down the task, because participants would have to redo change their utterance midway. Table 11.6. Number of participants interrupting Karin. Opt Often/Regularly Sometimes Never
4 12
Sub 5 2 9
Ran 4 3 9
These figures are not conclusive, but give an indication that at least in the optimal version, participants did seem to take into account the gaze behaviour of Karin as part of the cues that regulate turn-taking behaviour.
5.
Conclusion
In face-to-face conversations between human interlocutors, gaze is an important factor in signalling interpersonal attitudes and personality. Gaze and mutual gaze also function as indicators that help in guiding turn-switching. In the experiment that we have conducted, we were interested in the effects of implementing a simple strategy to control eye-movements of an artificial agent at turn-taking boundaries. The crude rules that we have used are sufficient to establish significant improvements in communication between humans and embodied conversational agents. So, therefore, the effort to investigate and implement human-like behaviour in artificial agents seems to be well worth the investment.
References Argyle, M. (1993). Bodily Communication. Routledge, second edition. Argyle, M. and Cook, M. (1976). Gaze and Mutual Gaze. Cambridge: Cambridge University Press.
Controlling the Gaze of Conversational Agents
261
Argyle, M. and Dean, J. (1972). Eye Contact, Distance and Affiliation. In Laver, J. and Hutcheson, S., editors, Communication in Face to Face Interaction, pages 155–171. Penguin. Argyle, M., Lefebre, L., and Cook, M. (1974). The Meaning of Five Patterns of Gaze. European Journal of Social Psychology, 4(2):125–136. Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., and Stone, M. (1994). Animated Conversation. Rule Based Generation of Facial Expression, Gesture and Spoken Intonation for Multiple Conversational Agents. Computer Graphics, pages 413–420. Cassell, J., Sullivan, J., Prevost, S., and Churchill, E., editors (2000). Conversational Agents. MIT Press. Cassell, J., Torres, O., and Prevost, S. (1999). Turn Taking vs. Discourse Structure. In Wilks, Y., editor, Machine Conversations, pages 143–154. Kluwer. Chopra-Khullar, S. and Badler, N. I. (1999). Where to Look? Automating Attending Behaviors of Virtual Human Characters. In Proceedings of Autonomous Agents, pages 9–23, Seattle. Colburn, R. A., Cohen, M. F., and Drucker, S. M. (2000). Avatar Mediated Conversational Interfaces. Technical Report MSR-TR-2000-81, Microsoft. Fukayama, A., Ohno, T., Mukawa, N., Sawaki, M., and Hagita, N. (2002). Messages Embedded in Gaze of Interface Agents. Impression Management with Agent’s Gaze. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 41–48, Minneapolis, Minnesota, USA. ACM. Garau, M., Slater, M., Bee, S., and Sasse, M.A. (2001). The Impact of Eye Gaze on Communication Using Humanoid Avatars. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 309–316, Seattle, Washington, USA. Heylen, D., Nijholt, A., and Poel, M. (2001). Embodied Agents in Virtual Environments: The Aveiro Project. In Proceedings of European Symposium on Intelligent Technologies, Hybrid Systems and their Implementation on Smart Adaptive Systems, pages 110–111, Tenerife, Spain. Verlag Mainz, Wissenschaftsverlag Aachen. Kendon, A. (1990). Some Functions of Gaze Direction in Two-Person Conversation. In Conducting Interaction, pages 51–89. Cambridge: Cambridge University Press. Kleinke, C. L. (1987). Gaze and Eye Contact: A Research Review. Psychological Bulletin, 100:78–100. Nijholt, A. and Hulstijn, J. (2000). Multimodal Interactions with Agents in Virtual Worlds. In Kasabov, N., editor, Future Directions for Intelligent Information Systems and Information Science, pages 148–173. Physica-Verlag.
262
Advances in Natural Multimodal Dialogue Systems
Novick, D. G., Hansen, B., and Ward, K. (1996). Coordinating Turn-Taking with Gaze. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 1888–1891, Philadelphia, Pennsylvania, USA. Poggi, I., Pelachaud, C., and de Rosis, F. (2000). Eye Communication in a Conversational 3D synthetic Agent. Special Issue on Behavior Planning for Life-Like Characters and Avatars. AI Communications, 13(3):169–181. Thórisson, K. R. and Cassell, J. (1996). Why Put an Agent in a Body: The Importance of Communicative Feedback in Human-Humanoid Dialogue. In Proceedings of Lifelike Computer Characters, pages 44–45, Utah, USA. Vertegaal, R. (1999). The GAZE Groupware System: Mediating Joint Attention in Multiparty Communication and Collaboration. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 294– 301, Pittsburgh, Pennsylvania, USA. ACM. Vertegaal, R., Slagter, R., van der Veer, G., and Nijholt, A. (2001). Eye Gaze Patterns in Conversation. There is More to Conversational Agents than Meets the Eyes. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 301–308, Seattle, Washington, USA.
PART IV
ARCHITECTURES AND TECHNOLOGIES FOR ADVANCED AND ADAPTIVE MULTIMODAL DIALOGUE SYSTEMS
Chapter 12 MIND: A CONTEXT-BASED MULTIMODAL INTERPRETATION FRAMEWORK IN CONVERSATIONAL SYSTEMS Joyce Y. Chai Department of Computer Science and Engineering Michigan State University, East Lansing, Michigan 48824, USA [email protected]
Shimei Pan and Michelle X. Zhou IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532, USA
{shimei, mzhou}@us.ibm.com Abstract
In a multimodal human-machine conversation, user inputs are often abbreviated or imprecise. Simply fusing multimodal inputs together may not be sufficient to derive a complete understanding of the inputs. Aiming to handle a wide variety of multimodal inputs, we are building a context-based multimodal interpretation framework called MIND (Multimodal Interpreter for Natural Dialog). MIND is unique in its use of a variety of contexts, such as domain context and conversation context, to enhance multimodal interpretation. In this chapter, we first describe a fine-grained semantic representation that captures salient information from user inputs and the overall conversation, and then present a context-based interpretation approach that enables MIND to reach a full understanding of user inputs, including those abbreviated or imprecise ones.
Keywords:
Multimodal input interpretation, multimodal interaction, conversation systems.
1.
Introduction
To aid users in their information-seeking process, we are building an infrastructure, called Responsive Information Architect (RIA), which can engage users in an intelligent multimodal conversation. Currently, RIA is embodied in 265 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 265–285. © 2005 Springer. Printed in the Netherlands.
266
Advances in Natural Multimodal Dialogue Systems
Figure 12.1.
RIA infrastructure.
a testbed, called Real HunterTM , a real-estate application for helping users find residential properties. Figure 12.1 shows RIA’s main components. A user can interact with RIA using multiple input channels, such as speech and gesture. To understand a user input, the multimodal interpreter exploits various contexts (e.g., conversation history) to produce an interpretation frame that captures the meanings of the input. Based on the interpretation frame, the conversation facilitator decides how RIA should act by generating a set of conversation acts (e.g., Describe information to the user). Upon receiving the conversation acts, the presentation broker sketches a presentation draft that expresses the outline of a multimedia presentation. Based on this draft, the language and visual designers work together to author a multimedia blueprint which contains the details of a fully coordinated multimedia presentation. The blueprint is then sent to the media producer to be realized. To support all components described above, an information server supplies various contextual information, including domain data (e.g., houses and cities for a real-estate application), a conversation history (e.g., detailed conversation exchanges between RIA and a user), a user model (e.g., user profiles), and an environment model (e.g., device capabilities). Our focus in this chapter is on the interpretation of multimodal user inputs. We are currently developing a context-based multimodal interpretation framework called MIND (Multimodal Interpreter for Natural Dialog). MIND is inspired by the earlier works on input interpretation from both multimodal
MIND: A Context-Based Multimodal Interpretation Framework
267
interaction systems, e.g., [Bolt, 1980; Burger and Marshall, 1993; Cohen et al., 1997; Zancanaro et al., 1997; Wahlster, 1998; Johnston and Bangalore, 2000] and spoken dialog systems [Allen et al., 2001; Wahlster, 2000]. Specifically, MIND presents two unique features. First, MIND exploits a fine-grained semantic model that characterizes the meanings of user inputs and the overall conversation. Second, MIND employs an integrated interpretation approach that uses a wide variety of contexts (e.g., conversation history and domain knowledge). These two features enable MIND to enhance understanding of user inputs, including those ambiguous and incomplete inputs.
2.
Related Work
Since the first appearance of the “Put-That-There” system [Bolt, 1980], a variety of multimodal systems have emerged, from earlier versions that combined speech, mouse pointing [Neal and Shapiro, 1988], and gaze [Koons et al., 1993], to systems that integrate speech with pen based gestures, e.g., hand drawn graphics [Cohen et al., 1997; Wahlster, 1998]. There are also more sophisticated systems that combine multimodal inputs and outputs [Cassell et al., 1999], and those that work in a mobile environment [Johnston et al., 2002; Oviatt, 2000]. Recently, we have seen a new generation of systems that not only support multimodal user inputs, but can also engage users in an intelligent conversation [Alexandersson and Becker, 2001; Gustafson et al., 2000; Johnston et al., 2002]. To function effectively, each of these systems must be able to adequately interpret multimodal user inputs. Substantial work on multimodal interpretation has been focusing on semantic fusion [Johnston, 1998; Johnston and Bangalore, 2000; Vo and Wood, 1996; Wu et al., 1999]. In contrast, this chapter describes a framework that combines semantic fusion with context-based inference for multimodal interpretation.
3.
MIND Overview
To interpret multimodal user inputs, MIND takes three major steps as shown in Figure 12.2: unimodal understanding, multimodal understanding, and discourse understanding. During unimodal understanding, MIND applies modality specific recognition and understanding components (e.g., a speech recognizer and a language interpreter) to identify meanings of each unimodal input. During multimodal understanding, MIND combines semantic meanings of unimodal inputs and uses contexts (e.g., conversation context and domain context) to form an overall understanding of multimodal user inputs. Furthermore, MIND also identifies how an input relates to the overall conversation discourse through discourse understanding. In particular, MIND groups together inputs that contribute to the same goal/sub-goal [Grosz and Sidner, 1986]. The result of discourse understand-
268
Advances in Natural Multimodal Dialogue Systems
Figure 12.2. MIND overview.
ing is an evolving conversation history that reflects the overall progress of a conversation.
4.
Example Scenario
Figure 12.3 shows a conversation fragment between a user and RIA. The user initiates the conversation by asking for houses in Irvington (U1), and RIA replies by showing a group of desired houses (R1). Based on the generated visual display, the user points to the screen (a position between two houses as in Figure 12.4) and asks for the price (U2). In this case, it is not clear which object the user is pointing at. There are three candidates: two houses nearby and the town of Irvington. Using our domain knowledge, MIND can rule out the town of Irvington, since he is asking for a price. At this point, MIND still cannot determine which of the two house candidates is intended. To clarify this ambiguity, RIA highlights both houses and asks the user to pinpoint the house of his/her interest (R2). Again, the user’s reply (U3) alone would be ambiguous, since there are multiple green objects on the screen. However, using the conversation history and the visual properties, MIND is able to infer that s/he is referring to the highlighted green house. The user continues on to ask for the size (U4). This request by itself is incomplete, since s/he did not explicitly specify the object of his/her interest (house). Nevertheless, MIND understands that the user is asking for the size of the same green house based on the conversation context.
MIND: A Context-Based Multimodal Interpretation Framework User (U1): RIA (R1): User (U2): RIA (R2): User (U3): RIA (R3): User (U4): RIA (R4): User (U5): RIA (R5): User (U6): RIA (R6):
269
Speech: Show me houses here Gesture: point to Irvington on the screen Speech: Here are the houses you requested Graphics: show a set of house icons on the screen Speech: How much is this? Gesture: point to the screen (not directly on any object) Speech: Which house are you interested in? Graphics: highlight both house candidates Speech: The green one. Speech: The green house costs 250,000 dollars Graphics: highlight the green house Speech: How large? Speech: The size of this house is 2500 square feet. Speech: What about this one? Gesture: point to another house Speech: The size of this house is 2200 square feet Graphics: highlight another house Speech: Show me houses with this style around here. Gesture: point to east of Irvington on the map Speech: I found seven Victorian style houses in White Plains Graphics: show seven houses icons in White Plains Figure 12.3.
Example scenario.
The user moves on to inquire about another house (U5). This input by itself does not indicate exactly what the user wants. Again, using the conversation context, MIND recognizes that s/he is most likely asking for the size of another house. Finally, in U6, the user asks for houses with a certain style in a certain location. Although there is only one deictic gesture, based on the visual context, MIND understands that “this style” refers to the style of the highlighted house on the screen and “here” refers to the location of White Plains, which is east of Irvington. As mentioned above, our example shows that multimodal user inputs exhibit a wide range of varieties. The inputs may be abbreviated, ambiguous or complex, and simply fusing inputs together often cannot reach a full understanding. For example the deictic gesture in U2 is ambiguous. It is not clear which object the user is pointing at: two houses nearby or the town of Irvington. The user input U5 by itself is incomplete, since the purpose of the input is unspecified. Furthermore, in U6, a single deictic gesture overlaps (in terms of time) with both “this style” and “here” from the speech input; it is hard to determine which one of those two references should be aligned and fused with the gesture. Processing these inputs requires context to be taken into consideration. Therefore, we have designed and implemented a contextbased interpretation approach in MIND. Currently, MIND uses three types of contexts: domain context, conversation context, and visual context. The do-
270
Advances in Natural Multimodal Dialogue Systems
User points here
Figure 12.4.
Imprecise pointing on the graphic display.
main context provides domain knowledge. The conversation context reflects the progress of the overall conversation. The visual context gives the detailed semantic and syntactic structures of visual objects and their relations. In the next few sections, we first describe our semantics-based representation, and then present the context-based approach using this representation.
5.
Semantics-Based Representation
To support context-based multimodal interpretation, we need to represent both user inputs and various contextual information. In this chapter, we focus on describing the representations of user inputs and the conversation context. In particular, we discuss two aspects of the representation: semantic models that capture salient information and structures that represent these semantic models.
5.1
Semantic Modelling of User Inputs
To model user inputs, MIND has two goals. First, MIND must understand the meanings of user inputs so that the conversation facilitator (Figure 12.1) can decide how the system should act. Second, MIND should capture the user input styles (e.g., using a particular verbal expression or gesturing in a particular way) or user communicative preferences (e.g., preferring a verbal vs. a visual presentation) so that such information can help the multimedia generation components (visual or language designers in Figure 12.1) to create more effective and tailored system responses. To accomplish both goals, MIND characterizes four aspects of a user input: intention, attention, presentation preference, and interpretation status.
MIND: A Context-Based Multimodal Interpretation Framework
271
Intention describes the purpose of a user in5.1.1 Intention. put [Grosz and Sidner, 1986]. We further characterize three aspects of an intention: Motivator, Act, and Method. Motivator captures the purpose of an interaction. Since we focus on information-seeking applications, MIND currently distinguishes three top-level purposes: DataPresentation, DataAnalysis (e.g., comparison), and ExceptionHandling (e.g., disambiguation). Act indicates one of the three user actions: request, reply, and inform. Request specifies that a user is making an information request (e.g., asking for a collection of houses in U1). Reply indicates that the user is responding to a previous RIA request (e.g., confirming the house of interest in U3). Unlike Request or Reply, Inform states that a user is simply providing RIA with specific information, such as personal profiles or interests. For example, during a house exploration, a user may tell RIA that she has school-age children. Method further refines a user action. For example, MIND may distinguish two different types of requests. One user may request RIA to Describe the desired information, such as the price of a house, while the other may request RIA simply to Identify the desired information (e.g., show a train station on the screen). In some cases, Motivator, Act and Method can be directly captured from individual inputs (e.g. U1). However, in other situations, they can only be inferred from the conversation discourse. For example, from U3 itself, MIND only understands that the user is referring to a green object. It is not clear whether this is a reply or an inform. Moreover, the overall purpose of this input is also unknown. Nevertheless, based on the conversation context, MIND understands that this input is a reply to a previous RIA question (Act: Reply), and contributes to the overall purpose of an exception handling intent (Motivator: ExceptionHandling). In addition to recording the purpose of each user input, Motivator also captures discourse purposes (described later). Therefore, Motivator can be also viewed as characterizing sub-dialogues discussed in previous literatures [Lambert and Carberry, 1992; Litman and Allen, 1987]. For example, ExceptionHandling (with Method: Correct) corresponds to a Correction sub-dialogue. However, unlike earlier works, our Motivator is used to model intentions at both input (turn) and discourse levels. Finally, we model intention not only to support the understanding of a conversation, but also to facilitate the multimedia generation. Specifically, Motivator and Method together direct RIA in its response generation. For example, RIA would consider Describe and Identify two different data presentation directives [Zhou and Pan, 2001]. Figure 12.5(a) shows the Intention that MIND has identified from the user input U2 (Figure 12.3). It says that the user is asking RIA to present him with desired information, which is captured in Attention below.
272
Advances in Natural Multimodal Dialogue Systems
(a) Intention
(b) Attention
Motivator: DataPresentation Act: Request Method: Describe
Base: House Topic: Instance Focus: SpecificAspect Aspect: Price Content: <MLS0187652 | MLS0889234>
Figure 12.5.
(c)
Presentation Preference
Directive: Summary Media: Multimedia Device: Desktop Style: < >
(d)
Interpretation Status
SyntacitcCompleteness: ContentAmbiguity SemanticCompleteness: Yes
Semantic modelling for user inputs.
5.1.2 Attention. While Intention indicates the purpose of a user input, Attention captures the content of a user input with six dimensions. Base specifies the semantic category of the content (e.g., all houses in our application belong to the House category). Topic indicates whether the user is concerned with a concept, a relation, an instance, or a collection of instances. For example, in U1 (Figure 12.3) the user is interested in a collection of House, while in U2 he is interested in a specific instance. Focus further narrows down the scope of the content to distinguish whether the user is interested in a topic as a whole or just the specific aspects of the topic. For example, in U2 the user focuses only on one specific aspect (price) of a house instance. Aspect enumerates the actual topical features that the user is interested in (e.g., the price in U2). Constraint holds the user constraints or preferences placed on the topic. For example, in U1 the user is only interested in the house instances (Topic) located in Irvington (Constraint). The last parameter Content points to the actual data in our database. Figure 12.5(b) records the Attention identified by MIND for the user input U2. It states that the user is interested in the price of a house instance, MLS0187652 or MLS0889234 (house ids from the Multiple Listing Service). As discussed later, our fine-grained modelling of Attention provides MIND the ability to discern subtle changes in user interaction (e.g., a user may focus on one topic but explore different aspects of the topic). This in turn helps MIND assess the overall progress of a conversation. 5.1.3 Presentation preference. During a human-computer interaction, a user may indicate what type of responses she prefers. Currently, MIND captures user preferences along four dimensions. Directive specifies the highlevel presentation goal (e.g., preferring a summary to details). Media indicates the preferred presentation medium (e.g., verbal vs. visual). Style describes what general formats should be used (e.g., using a chart vs. a diagram to illustrate information). Device states what devices would be used in the presentation (e.g., phone or PDA). Using the captured presentation preferences, RIA can generate multimedia presentations that are tailored to individual users and their goals. For example, Figure 12.5(c) records the user preferences from U2.
MIND: A Context-Based Multimodal Interpretation Framework
273
Since the user did not explicitly specify any preferences, MIND uses the default values to represent those preferences. Presentation preferences can either be directly derived from user inputs or inferred based on user and environment contexts.
5.1.4 Interpretation status. Interpretation status provides an overall assessment on how well MIND understands an input. This information is particularly helpful in guiding RIA’s next move. Currently, it includes two features. SyntacticCompleteness assesses whether there is any unknown or ambiguous information in the interpretation result. SemanticCompleteness indicates whether the interpretation result makes sense. Using the status, MIND can inform other RIA components whether a certain exception has risen. For example, the value ContentAmbiguity in SyntacticCompleteness (Figure 12.5d) indicates that there is an ambiguity concerning Content in Attention, since MIND cannot determine whether the user is interested in MLS0187652 or MLS0889234. Based on this status, RIA would ask a clarification question to disambiguate the two houses (e.g., R2 in Figure 12.3).
5.2
Semantic Modelling of Conversation Discourse
In addition to modelling the meanings of user inputs at each conversation turn, we also model the overall progress of a conversation. Based on Grosz and Sidner’s conversation theory [Grosz and Sidner, 1986], MIND establishes a refined discourse structure as conversation proceeds. This is different from other multimodal systems that maintain the conversation history by using a global focus space [Neal et al., 1998], segmenting a focus space based on intention [Burger and Marshall, 1993], or establishing a single dialogue stack to keep track of open discourse segments [Stent et al., 1999].
5.2.1 Conversation unit and segment. Our discourse structure has two main elements: conversation units and conversation segments. A conversation unit records user or RIA actions at a single turn of a conversation. These units can be grouped together to form a segment (e.g., based on their intentional similarities). Moreover, different segments can be organized into a hierarchy (e.g., based on intentions and sub-intentions). Figure 12.6 depicts the discourse structure of the conversation after interpreting U3 in Figure 12.2. This structure contains five conversation units (rectangles U1–3 for the user, R1–2 for RIA) and three conversation segments (ovals DS1–3). A user conversation unit contains the interpretation result of a user input discussed in the last section (as shown in U1). A RIA unit contains the automatically generated multimedia response, including the semantic and syntactic structures of a multimedia presentation [Zhou and Pan, 2001]. A conversation
274
Advances in Natural Multimodal Dialogue Systems
DS1
Attention Switch, Temporal Proceeds Initiator: User Addressee: RIA State: Accompolished Intention Motivator: DataPresentation Method: Identify DS2 … Attention ... Content: [MLS2365783,….] …
Initiator: User Addressee: RIA State: Active Intention Motivator: DataPresentation Method: Describe … Attention ... Content: [MLS0187652 ] … Dominate
U1 Intention Act: Request Motivator: DataPresentation Method: Identify Attention Base: House Concept … PresentationPreference …. InterpretationStatus
R1
U2
DS3
Initiator: RIA Addressee: User State: Accompolished … Intention Motivator: ExceptionHandling Method: Disambiguate … Attention ... Content: [MLS0187652 | MLS0889234]
R2
Figure 12.6.
U3
Conversation discourse.
segment has five features: Initiator, Addressee, State, Intention, and Attention. The Intention and Attention are similar to those modelled in the units (see DS1 and U1). Initiator indicates the conversation initiating participant (e.g., Initiator is User in DS1), while Addressee indicates the recipient of the conversation (e.g., Addressee is RIA in DS1). Currently, we are focused on one-to-one conversations. However, MIND can be extended to support multi-party conversations where Addressee could be a group of agents. Finally, State reflects the current state of a segment: active, accomplished or suspended. For example, after interpreting U3, DS2 is still active (before R2 is generated), but DS3 is already accomplished, since its purpose of disambiguating the content has been fulfilled.
5.2.2 Discourse relations. To model the progress in a conversation, MIND captures three types of discourse relations: conversation structural relations, conversation transitional relations, and data transitional relations. Conversation structural relations reveal the intentional structure between the purposes of conversation segments. Following Grosz and Sidner’s early work, there are currently two types: dominance and satisfaction-precedence. For example, in Figure 12.6, DS2 dominates DS3, since the segment of ExceptionHandling (DS3) is for the purpose of DataPresentation (DS2).
MIND: A Context-Based Multimodal Interpretation Framework
275
Conversation transitional relations specify transitions between conversation segments and between conversation units as the conversation unfolds. Currently, two types of relations are identified between segments: intention switch and attention switch. The intention switch relates two segments, which differ in their intentions. Interruption is a sub-type of an intention switch. The attention switch relates two segments, which possess the same intention but differ in their attention. For instance, in Figure 12.6, there is an attention switch from DS1 to DS2 since DS1 concerns about a collection of houses and DS2 focuses on one particular house. Furthermore, there are also temporal-precedence relations that connect different segments together based on the order when they occur. The temporal-precedence relation also connects conversation units to preserves the sequence of conversation. Data transitional relations further discern different types of attention switches. In particular, we distinguish eight types of attention switches, including Collection-to-Instance and Instance-to-Aspect. For example, the attention is switched from a collection of houses in DS1 to a specific house in DS2 (Figure 12.6). Data transitional relations allow MIND to capture user data exploration patterns. Such patterns in turn can help RIA decide potential data navigation paths and provide users with an efficient informationseeking environment. Our studies showed that, in an information-seeking environment, the conversation flow usually centres around the data transitional relations. This is different from task-oriented applications where dominance and satisfaction precedence are greatly observed. In an information seeking application, the communication is more focused on the type and the actual content of information, which often by itself does not impose any dominance or precedence relations.
5.3
Representing Intention and Attention in Feature-Based Structures
Based on the semantic models of intention and attention described earlier, MIND uses feature-based structures to represent intention and attention. The type of a structure is captured by a special feature FsType. In an Intention structure, FsType takes the value of Motivator (i.e., the purpose of communication), and in an Attention structure, FsType takes the value of Base (i.e., the semantic category of the content). Since the values of Motivator or Base may not be able to be inferred from current user inputs directly, we have added a value Unknown. In addition to other characteristics of using feature-based structures [Carpenter, 1992], our representation has two unique features. The first characteristic is that intention and attention are consistently represented at different processing stages. More specifically, MIND uses the same
276
Advances in Natural Multimodal Dialogue Systems
set of features to represent intention and attention identified from unimodal user inputs, combined multimodal inputs, and conversation segments. Figure 12.7(a) outlines the intention and attention identified by MIND for the speech input in U2. Since the semantic type of the content is unknown, the type of the Attention is set to unknown (FsType: Unknown). Note that here we only include the features that can be instantiated. For example feature Content is not present in the Attention structure, since the exact object of the instance is not specified in the speech input.
Intention FsType: Unknown Attention FsType: House Topic: Instance Content: {MLS0187652}
Intention FsType: DataPresentation Act: Request Method: Describe Attention FsType: Unknown Topic: Instance Focus: SpecificAspect Aspect: Price
FsType: House Topic: Instance Content: {MLS0889234} FsType: City Topic: Instance Content: {“Irvington”}
(a) U2 speech: “How much is this”
(b) U2 gesture: pointing
Intention FsType: DataPresentation Act: Request Method: Describe Attention FsType: House Topic: Instance A1 Focus: SpecificAspect Aspect: Price Content:{MLS0187652} FsType: House Topic: Instance Focus: SpecificAspect A2 Aspect: Price Content:{MLS0889234}
A3
FsType: City Topic: Instance Focus: SpecificAspect Aspect: Price Content:{“Irvington”}
(c) U2 as a result of multimodal fusion
Figure 12.7.
Feature structures for intention and attention.
Similarly, we represent the semantic information extracted from a deictic gesture. Figure 12.7(b) shows the corresponding feature structures for U2 gesture input. The Intention structure has an Unknown type since the high level purpose and the specific task cannot be identified from the gesture input. Furthermore, because of the ambiguity of the deictic gesture, three Attention structures are identified. The first two are for house instances MLS0187652 and MLS0889234, and the third is for the town of Irvington. MIND performs multimodal fusion by combining each modality’s feature structures into a single feature structure. The result of multimodal fusion is shown in Figure 12.7(c). Furthermore, MIND uses the same kind of feature structures to represent intention and attention in conversation segments. Figure 12.8 records the conversation segments that cover U2 through R3. As described later, such a consistent representation facilitates a context-driven inference. The second characteristic of our representation is its flexible composition. One feature structure can be nested in another feature structure. For exam-
277
MIND: A Context-Based Multimodal Interpretation Framework
FsType: DataPresentation Intention Method: Describe DS2 FsType: House Topic: Instance Attention Focus: SpecificAspect Aspect: Price Content: {MLS0187652}
U2 Intention FsType: DataPresentation Act: Request Method: Describe Attention FsType: House Topic: Instance Focus: SpecificAspect Aspect: Price Content: {MLS0187652 | MLS0889234}
DS1
Intention FsType: ExceptionHandling Method: Disambiguate Attention FsType: House Topic: Instance Content: {MLS0187652 | MLS0889234}
R2 Intention …. Attention ….
Figure 12.8.
R3
U3
Feature structures for intention and attention in conversation segments.
ple, U6 in Figure 12.2 is a complex input, where the speech input “show me houses with this style around here” consists of two references. The Attention structure created for U6 speech input is shown in Figure 12.9. The structure A1 indicates that the user is interested in a collection of houses that satisfy two constraints. The first constraint is about the style (Aspect: Style), and the second is about the location. Both of these constraints are related to other objects, which are represented by similar Attention structures A2 and A3 respectively. During the interpretation process, MIND first tries to resolve these two references and then reformulates the overall constraints [Chai, 2002b].
6.
Context-Based Multimodal Interpretation
Based on the semantic representations described above, MIND uses a wide variety of contexts to interpret the rich semantics of user inputs. Currently, we support three input modalities: speech, text, and gesture. Specifically, we use IBM ViaVoice to perform speech recognition, and a statistics-based natural language understanding component [Jelinek et al., 1994] to process natural language sentences. For gestures, we have developed a simple geometry-based gesture recognition and understanding component. Based on the output from
278
Advances in Natural Multimodal Dialogue Systems
Figure 12.9.
Attention structures for U6 speech input.
unimodal understanding, MIND performs multimodal understanding, consisting of two sub-processes: multimodal fusion and context based inference.
6.1
Multimodal Fusion
During multimodal fusion, MIND first uses temporal constraints to align Intention/Attention structures identified from each modality, and then to unify the corresponding structures. The formally defined unification operation provides a mechanism to combine information from a number of distinct sources into a unique representation [Carpenter, 1992]. Two feature structures can be unified if they have compatible types based on an inheritance hierarchy (an Unknown type always subsumes other known types), and the values of the same features are also consistent with each other (i.e., satisfying an inheritance hierarchy). Otherwise, unification fails. Based on this nature, unification is applied in multimodal fusion since information from different modalities tends to complement each other [Johnston, 1998]. MIND also applies unification to multimodal fusion. For example, in Figure 12.7(b), there are three Attention structures for U2 gesture input due to the imprecise pointing. Each of them can be unified with the Attention structure from U2 speech input (in Figure 12.7a). The fusion result is shown in Figure 12.7(c). In this case, an ambiguity arises due to the post-fusion presence of three Attention structures. In many other cases, the overall meanings of a user input still cannot be identified as a result of multimodal fusion. For example, the exact focus of attention for U4 is still unclear after the fusion. To enhance interpretation, MIND uses various contexts to make inferences about the inputs.
MIND: A Context-Based Multimodal Interpretation Framework
279
Covering feature structure A onto feature structure B creates feature structure C in following steps: (1) if ( the type of A and the type of B are not compatible) then covering fails else if the type of A is Unknown assign the type of B to the type of C else assign the type of A to the type of C (2) for (each feature fi in both A and B) { Suppose ui is the value in A, vi is the value in B, and wi is the value in C (a) if ui and vi are feature structures then covering ui onto vi and assign the result to wi // recursively applying covering (b) else assign ui to wi // the value in the covering structure prevails (3) for (each feature in A but not in B) add this feature and its value in C (4) for (each feature in B but not in A) add this feature and its value in C Figure 12.10.
6.2
Covering operation.
Context-Based Inference
Currently, MIND uses three types of context: conversation context, domain context, and visual context.
6.2.1 Conversation context. Conversation context provides an overall history of a conversation as described earlier. In an information-seeking environment, users tend to only explicitly or implicitly specify the new or changed aspects of the information of their interest without repeating what has been mentioned earlier in the conversation. Given a partial user input, required but unspecified information needs to be inferred from the conversation context. Currently, MIND applies an operation, called covering, to draw inferences from the conversation context [Chai, 2002a]. Although the mechanism of our covering operation is similar to the overlay operation described in [Alexandersson and Becker, 2001], not only can our covering infer the focus of attention (as overlay does), but it can also infer the intention. What makes this operation possible is our underlying consistent representation of intention and attention at both the discourse and the input levels. Covering combines two feature structures by placing one structure (covering structure) over the other one (covered structure). Figure 12.10 describes the steps of the covering operation. Specifically, if the types of two structures are not compatible, then the covering fails (step 1). For the same features in both structures, the values from the covering structure always prevail and are included in the resulting structure (step 2b). Covering is recursive (step
280
Advances in Natural Multimodal Dialogue Systems
Figure 12.11.
Example of context-based inference using covering.
2a). Features that exist only in one structure but not in the other are included automatically in the resulting structure (steps 3 and 4). For example, to interpret U4, MIND applies the operation by covering U4 on DS2. As a result, features in DS2 (Topic and Content) are added in the combined structure and the value of the Aspect feature is changed to Price (Figure 12.11). Thus, MIND is able to figure out that the user is interested in the size of the same house as in U2. Note that it is important to maintain a hierarchical conversation history based on goals and sub-goals. Without such a hierarchical structure, MIND would not be able to infer the content of U4. Furthermore, because of the consistent representation of intention and attention at both the discourse level (in conversation segments) and the input level (in conversation units), MIND is able to directly use the conversation context to infer unspecified information. The covering operation can also be applied to U5 and the discourse segment built after R4 is processed. Therefore, MIND can infer that the user is asking for the size of another house in U5.
6.2.2 Domain context. Domain context provides the domain knowledge such as semantic and meta information about the application data. The domain context is particularly useful in resolving input ambiguities. For example, to resolve the ambiguity of whether the attention is a city object or a house object (as in U2), MIND uses the domain context. In this case, A3 in Figure 12.7(c) indicates the interest is the price of the city Irvington. Based on the domain knowledge that the city object does not have a price attribute, A3 is filtered out. As a result, both A1 and A2 are potential interpretations, and
MIND: A Context-Based Multimodal Interpretation Framework
281
RIA is able to arrange the follow-up question to further disambiguate the two houses (R2 in Figure 12.3).
6.2.3 Visual context. As RIA provides a rich visual environment for users to interact with, users may refer to objects on the screen by their spatial (e.g., the house at the left corner) or perceptual attributes (e.g., the red house). To resolve these spatial/perceptual references, MIND exploits the visual context, which logs the detailed semantic and syntactic structures of visual objects and their relations. More specifically, the automated generated visual encoding for each object is maintained as a part of the system conversation unit in our conversation history. During reference resolution, MIND would identify potential candidates by mapping the referring expressions with the internal visual representation. For example, the object which is highlighted on the screen (R5) has an internal representation that associates the visual property Highlight with an object identifier. This allows MIND to correctly resolve referents for “this style” in U6. The representation for U6 speech input in Figure 12.9 indicates three attention structures. The gesture input overlaps with both “this style” (corresponding to A2 ) and “here” (corresponding to A3 ); there is no obvious temporal relation indicating which of these two references should be unified with the deictic gesture. In fact, both A2 and A3 are potential candidates. An earlier study [Kehler, 2000] indicates that objects in the visual focus are often referred by pronouns or demonstratives, rather than by full noun phrases or deictic gestures. Based on this finding and using the visual context, MIND infers that most likely “this style” refers to the style of the highlighted house and the deictic gesture resolves the referent of “here”. Suppose that the style is Victorian and “here” refers to White Plains, MIND is then able to reformulate the constraints and figure out that the overall meaning of U6: looking for houses with a Victorian style and located in White Plains. During the context-based inference, MIND applies an instance-based approach to first determine whether there is enough information collected from the current user inputs. We have collected a set of instances, each of which is a pair of valid intention and attention structures that MIND can handle. By comparing the fused representation (the result of the multimodal fusion process) with the set of instances, MIND determines whether the information is adequate for further reasoning. If the information is sufficient, MIND then uses the domain context and visual context to further resolve ambiguities and validate the fused representation. If the information is inadequate, MIND applies the covering operation on conversation segments (starting from the most recent one) to infer the unspecified information from the conversation context.
282
7.
Advances in Natural Multimodal Dialogue Systems
Discussion
In a conversation setting, user inputs could be ambiguous, abbreviated, or complex. Simply fusing multimodal inputs together may not be sufficient to derive a complete understanding of the inputs. Therefore, we have designed and implemented a context-based framework, MIND, for multimodal interpretation. MIND relies on a fine-grained semantic representation that captures salient information from user inputs and the overall conversation. The consistent representation of intention and attention provides a basis for MIND to unify the multimodal fusion and context-based inference processes. Our ongoing work is focused on the following aspects. First of all, we continue to improve multimodal interpretation by incorporating a probabilistic model [Chai et al., 2004]. Our existing approach of fusion and inference is straightforward for simple user inputs. However, it may be complicated when multiple attentions from one input need to be unified with multiple attentions from another input. Suppose that the user says “tell me more about the red house, this house, the blue house,” and at the same time she points to two positions on the screen sequentially. This alignment can be easily performed when there is a clear temporal binding between a gesture and a particular phrase in the speech. However, in a situation where a gesture is followed (preceded) by a phrase without an obvious temporal association as in “tell me more about the red house (deictic gesture 1) this house (deictic gesture 2) the blue house,” a simple unification cannot solve the problem. Furthermore, we are investigating the use of other contexts, such as user context. The user context provides MIND with user profiles. A user profile is established through two means: explicit specification and automated learning. Using a registration process, information about user preferences can be gathered such as whether the school district is important. In addition, MIND can also learn user vocabularies and preferences based on real sessions. One attempt is to use this context to map fuzzy terms in an input to precise query constraints. For example, the interpretation of the terms, such as “expensive” or “big”, may vary greatly from one user to another. Based on different user profiles, MIND can interpret these fuzzy terms as different query constraints. The third aspect is improving discourse interpretation. Discourse interpretation identifies the contribution of user inputs toward the overall goal of a conversation. During the discourse interpretation, MIND decides whether the input at the current turn contributes to an existing segment or starts a new one. In the latter case, MIND must decide where to add the new segment and how this segment relates to existing segments in the conversation history.
MIND: A Context-Based Multimodal Interpretation Framework
283
Acknowledgements We would like to thank Keith Houck for his contributions on training models for speech/gesture recognition and natural language parsing, and Rosario Uceda-Sosa for her work on RIA information server.
References Alexandersson, J. and Becker, T. (2001). Overlay as the Basic Operation for Discourse Processing in a Multimodal Dialogue System. In Proceedings of the IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Seattle, Washington, USA. Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., and Stent, A. (2001). Towards Conversational Human-Computer Interaction. AI Magazine, 22(4):27–37. Bolt, R. A. (1980). Voice and Gesture at the Graphics Interface. Computer Graphics, pages 262–270. Burger, J. and Marshall, R. (1993). The Application of Natural Language Models to Intelligent Multimedia. In Maybury, M., editor, Intelligent Multimedia Interfaces, pages 429–440. Menlo Park, CA: MIT Press. Carpenter, B. (1992). The Logic of Typed Feature Structures. Cambridge University Press. Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjálmsson, H., and Yan, H. (1999). Embodiment in Conversational Interfaces: Rea. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 520–527, Pittsburgh, PA. Chai, J. (2002a). Operations for Context-Based Multimodal Interpretation in Conversational Systems. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 2249–2252, Denver, Colorado, USA. Chai, J. (2002b). Semantics-Based Representation for Multimodal Interpretation in Conversational Systems. In Proceedings of the 19th International Conference on Computational Linguistics (COLING), pages 141–147, Taipei, Taiwan. Chai, J., Hong, P., and Zhou, M. (2004). A Probabilistic Approach for Reference Resolution in Multimodal User Interfaces. In Proceedings of the International Conference on Intelligent User Interfaces (IUI), pages 70–77, Madeira, Portugal. ACM. Cohen, P., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L., and Clow, J. (1997). Quickset: Multimodal Interaction for Distributed Applications. In Proceedings of the Fifth Annual International ACM Multimedia Conference, pages 31–40, Seattle, USA.
284
Advances in Natural Multimodal Dialogue Systems
Grosz, B. J. and Sidner, S. (1986). Attention, Intentions, and the Structure of Discourse. Computational Linguistics, 12(3):175–204. Gustafson, J., Bell, L., Beskow, J., Boye, J., Carlson, R., Edlund, J., Granström, B., House, D., and Wirén, M. (2000). AdApt—A Multimodal Conversational Dialogue System in an Apartment Domain. In Proceedings of International Conference on Spoken Language Processing (ICSLP), volume 2, pages 134–137, Beijing, China. Jelinek, F., Lafferty, J., Magerman, D. M., Mercer, R., and Roukos, S. (1994). Decision Tree Parsing Using a Hidden Derivation Model. In Proceedings of Darpa Speech and Natural Language Workshop, pages 272–277. Johnston, M. (1998). Unification-Based Multimodal Parsing. In Proceedings of the 17th International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), pages 624–630, Montreal, Quebec, Canada. Johnston, M. and Bangalore, S. (2000). Finite-State Multimodal Parsing and Understanding. In Proceedings of the 18th International Conference on Computational Linguistics (COLING), pages 369–375, Saarbrücken, Germany. Johnston, M., Bangalore, S., Visireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S., and Maloor, P. (2002). MATCH: An Architecture for Multimodal Dialog Systems. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 376–383, Philadelphia, USA. Kehler, A. (2000). Cognitive Status and Form of Reference in Multimodal Human-Computer Interaction. In Proceedings of the 17th National Conference on Artifical Intelligence (AAAI), pages 685–689, Austin, Texas, USA. Koons, D. B., Sparrell, C. J., and Thorisson, K. R. (1993). Integrating Simultaneous Input from Speech, Gaze, and Hand Gestures. In Maybury, M., editor, Intelligent Multimedia Interfaces, pages 257–276. Menlo Park, CA: MIT Press. Lambert, L. and Carberry, S. (1992). Modeling Negotiation Subdialogues. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics (ACL), pages 193–200, Newark, Delaware, USA. Litman, D. J. and Allen, J. F. (1987). A Plan Recognition Model for Subdialogues in Conversations. Cognitive Science, 11:163–200. Neal, J. G. and Shapiro, S. C. (1988). Architectures for Intelligent Interfaces: Elements and Prototypes. In Sullivan, J. and Tyler, S., editors, Intelligent User Interfaces, pages 69–91. Addison-Wesley. Neal, J. G., Thielman, C. Y., Dobes, Z., Haller, S. M., and Shapiro, S. C. (1998). Natural Language with Integrated Deictic and Graphic Gestures. In Maybury, M. and Wahlster, W., editors, Intelligent User Interfaces, pages 38–52. Morgan Kaufmann.
MIND: A Context-Based Multimodal Interpretation Framework
285
Oviatt, S. L. (2000). Multimodal System Processing in Mobile Environments. In Proceedings of the Thirteenth Annual ACM Symposium on User Interface Software Technology (UIST), pages 21–30. New York: ACM Press. Stent, A., Dowding, J., Gawron, J. M., Bratt, E. O., and Moore, R. (1999). The Commandtalk Spoken Dialog System. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), pages 183–190, College Park, Maryland, USA. Vo, M. T. and Wood, C. (1996). Building an Application Framework for Speech and Pen Input Integration in Multimodal Learning Interfaces. In Proceedings of IEEE International Conference of Acoustic, Speech and Signal Processing, volume 6, pages 3545–3548, Atlanta, USA. Wahlster, W. (1998). User and Discourse Models for Multimodal Communication. In Maybury, M. and Wahlster, W., editors, Intelligent User Interfaces, pages 359–370. Morgan Kaufmann. Wahlster, W. (2000). Mobile Speech-to-Speech Translation of Spontaneous Dialogs: An Overview of the Final Verbmobil System. In Verbmobil: Foundations of Speech-to-Speech Translation, pages 3–21. Springer Press. Wu, L., Oviatt, S., and Cohen, P. (1999). Multimodal Integration - A Statistical View. IEEE Transactions on Multimedia, 1(4):334–341. Zancanaro, M., Stock, O., and Strapparava, C. (1997). Multimodal Interaction for Information Access: Exploiting Cohesion. Computational Intelligence, 13(4):439–464. Zhou, M. X. and Pan, S. (2001). Automated Authoring of Coherent Multimedia Discourse in Conversation Systems. In Proceedings of the Ninth ACM International Conference on Multimedia, pages 555–559, Ottawa, Ontario, Canada.
Chapter 13 A GENERAL PURPOSE ARCHITECTURE FOR INTELLIGENT TUTORING SYSTEMS Brady Clark, Oliver Lemon, Alexander Gruenstein, Elizabeth Owen Bratt, John Fry, Stanley Peters, Heather Pon-Barry, Karl Schultz, Zack ThomsenGray and Pucktada Treeratpituk Center for the Study of Language and Information Stanford, CA, USA [email protected], [email protected],
{alexgru, ebratt, fry, peters, ponbarry, schultzk, ztgray}@csli.stanford.edu, [email protected]
Abstract
The goal of the Conversational Interfaces project at CSLI is to develop a general purpose architecture which supports multi-modal dialogues with complex devices, services, and applications. We are developing generic dialogue management software which supports collaborative activities between a human and devices. Our systems use a common software base consisting of the Open Agent Architecture, Nuance speech recogniser, Gemini (SRI’s parser and generator), Festival speech synthesis, and CSLI’s “Architecture for Conversational Intelligence” (ACI). This chapter focuses on one application of this architecture - an intelligent tutoring system for shipboard damage control. We discuss the benefits of adopting this architecture for intelligent tutoring.
Keywords:
Intelligent tutoring systems, multimodal dialogue, dialogue systems, speech technology.
1.
Introduction
Multimodal, activity-oriented dialogues with devices present a challenge for dialogue system developers. Conversational interaction in these contexts is mixed-initiative and open-ended. Consider dialogue with an intelligent tutoring system. Dialogue can be unpredictable in tutorial interactions. The student may need to ask the tutor a question; e.g., to request information or request
287 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 287–305. © 2005 Springer. Printed in the Netherlands.
288
Advances in Natural Multimodal Dialogue Systems
rephrasing. In (1), from [Shah et al., 2002, page 32] a student is requesting information in the form of a yes-no question. The tutorial domain is physiology. (1) Student: Did you count my prediction for sv? Tutor: Yes, but you haven’t predicted tpr. In (2), a student is requesting that the tutor rephrase their previous utterance [Shah et al., 2002, page 34]: (2) Tutor:
How are the falls in TPR and in CC connected to decrease in MAP? Student: I don’t think I understand the question. Tutor: What are the determinants of MAP?
Further, the tutor must have a way of reacting to various types of user input; e.g., by adjusting the tutorial agenda when the student asks for clarification about past topics of discussion or when the student asks the tutor to alter the initial overall tutoring plan (e.g., “Can we move on to the next topic?”). In this chapter we discuss a new general purpose architecture for intelligent dialogue systems which addresses these issues: the Architecture for Conversational Intelligence (ACI) developed at CSLI. The ACI has previously been used in a dialogue system for multimodal conversations with a robot helicopter (the WITAS system; [Lemon et al., 2002a; Lemon et al., 2002b]). We focus here on a recent deployment of this architecture in the domain of intelligent tutoring. We will first discuss the intelligent tutoring system we are developing for shipboard damage control. Next, we discuss the ACI for dialogue systems and what benefits it has for intelligent tutoring.
2.
An Intelligent Tutoring System for Damage Control
Shipboard damage control refers to the task of containing the effects of fire, explosions, and other critical events that can occur aboard Naval vessels. The high-stakes, high-stress nature of this task, together with limited opportunities for real-life training, make damage control an ideal target for AI-enabled educational technologies like intelligent tutoring systems. We are developing an intelligent tutoring system for automated critiquing of student performance on a damage control simulator [Clark et al., 2001]. The simulator is DC-Train [Bulitko and Wilkins, 1999], an immersive, multimedia training environment for damage control. DC-Train’s training scenarios simulate a mixture of physical phenomena (e.g., fire) and personnel issues (e.g., casualties). Recent research has shown that elaborate tutorial interaction during problem solving may be distracting or cause cognitive overload [Katz et al., 2000]. This
A General Purpose Architecture for Intelligent Tutoring Systems
Figure 13.1.
289
Screen shot of GUI for reflective intelligent tutorial dialogue system.
suggests that less may be best in certain learning situations ([Sweller et al., 1998] cited in [Katz et al., 2000]): the cognitive load of simply solving a problem may be high enough that the tutorial interaction should take place after the student has solved the problem (rather than during the problem solving session); i.e., reflective tutoring. Further, work by Katz et al. [2000], see also [Katz and Allbritton, 2002] has shown that reflective tutoring has a positive effect on learning and enhances the acquisition of strategic and conceptual knowledge. Additionally, we take as our starting point that a tutoring system should model tactics that promote constructive and effortful responses from students [Chi et al., 2001]. For these reasons, the intelligent tutoring system we have developed for damage control is reflective (the tutor generates plans for post-practice reflection) and Socratic (the tutor asks questions rather than giving explanations). Figure 13.1 is a screenshot of the graphical user interface for our reflective intelligent tutoring system for damage control. On the right side of the screen is a global view of the ship. Ship compartments are highlighted in coordination with the speech output of the system. The bottom left corner hosts the transcript of the student’s conversation with the tutor. The top left corner is the
290
Advances in Natural Multimodal Dialogue Systems
interactive ship display. This part of the GUI is a common workspace for the student and the tutor. The conversational participants can manipulate the ship display; e.g., both the student and tutor can highlight compartments. Thus, the tutor is truly multimodal: it coordinates linguistic input and output (speech) with non-linguistic input and output (the user can indicate a compartment with a mouse click or the system can highlight a compartment). (3) provides part of an actual session with our intelligent tutoring system. The tutor runs on a laptop using Windows 2000. (3)
Tutor:
Hello, we are about to review your session from yesterday. All right, are you ready?
Student: Yes. Tutor:
In this session there were 5 primary crises and 1 secondary crisis. 5 of the 6 crises were resolved. Most of your actions were correct, however you made several mistakes. OK, let’s discuss a few of those crises. The first crisis was a fire here in Crew Living space No. 1. [centres on this Compartment and highlights it] This fire was not contained early enough. As a result, the fire spread to the adjacent Combat system equipment room No. 2. [highlights 2nd compartment] OK, let’s review how to handle this crisis. [removes highlight from 2nd compartment] What’s the first thing you should do when the fire alarm sounds?
Student: Send repair two to investigate. Tutor:
Yes, that’s right. However, you failed to do this during the session. This is a mistake you made more than once. Remember, alarms need to be verified first. OK, suppose that there is a fire. What’s the next step?
Student: Set fire and smoke boundaries. Tutor:
Yes. That’s what you did during the session. OK, what else needs to be done?
Student: The repair team should fight the fire.
A General Purpose Architecture for Intelligent Tutoring Systems
Tutor:
291
No. You missed a step. Remember, the crisis might interfere with other ship systems. Let’s try again. What else needs to be done?
Student: I don’t know. Tutor:
Well. You should order repair five to mechanically and electrically isolate the compartment. But you didn’t do this during the session. This is a mistake you made more than once. Remember, mechanical and electrical isolation protects crucial ship systems. All right, suppose isolation has begun and boundaries have been set. What’s the next step?
Student: Fight the fire. Tutor:
Yes, exactly. That’s what you did.
... Tutor:
The next crisis was a flood here in Engine room No. 2. [highlights compartment] Let’s review how to handle this crisis. What’s the first thing you should do when the flood alarm sounds?
Student: Send repair two to investigate. Tutor:
Almost. Let’s review the jurisdiction of repair lockers. The flood alarm was here in Engine room No. 2 [highlights compartment]. Please click on the location of the nearest repair locker in the left display above. Let me know when you’re ready to continue.
Student: [clicks on correct compartment] Please continue. Tutor:
Yes, that’s right. Which repair locker is that?
Student: Repair five.
292
Advances in Natural Multimodal Dialogue Systems
Tutor:
Yes, that’s right. You should order repair five to investigate the compartment. But you sent the wrong repair team during the session.
The dialogue in (3) matches the 5-step dialogue frame that Graesser and Person [1994] observed in naturalistic tutoring. Step 1: Step 2: Step 3: Step 4:
Tutor asks question (or presents problem) Learner answers question (or begins to solve problem) Tutor gives short immediate feedback on the quality of the answer The tutor and learner collaboratively improve the quality of the answer Step 5: The tutor assesses the learner’s understanding of the answer Figure 13.2 is the overall architecture of our system (ASR = Automated Speech Recognition, TTS = Text-to-Speech). In addition to being a Socratic tutor, our tutor shares several features with other intelligent tutoring systems; e.g., CIRCSIM-Tutor [Zhou et al., 1999] and Atlas/Andes [Freedman, 2000]. A knowledge base (problem, solution, and domain knowledge in Figure 13.2): DC-Train currently encodes all knowledge relevant to supporting reflective intelligent tutoring into a structure called a Causal Story Graph. These expert summaries encode causal relationships between events on the ship as well as the proper and improper responses to shipboard crises. Tutoring tactics (see Figure 13.2): To respond to student answers to tutor questions, our tutor draws on a library of tutoring tactics. These tactics are very similar to the plan operators utilized in the Atlas/Andes system. The different components of these tactics are the object (what’s being taught), the preconditions (when a particular tactic can be applied), and the recipe (what is the method to be used to teach the object). Preconditions on tutoring tactics involve combinations of a classification of the student’s response (e.g., a fully correct or incorrect answer) and actions in their session with the simulator DC-Train (e.g., an error of omission or correct action). An interpretation-generationcomponent(ASR, TTS, Grammar, ParserGenerator, and Dialogue Manager in Figure 13.2): In our system, the student’s speech is recognized and parsed into logical forms (LFs). The architecture also allows the tutor’s speech to be generated from LF inputs, although we currently use template generation. A dialogue manager inspects the current dialogue information state to determine how
A General Purpose Architecture for Intelligent Tutoring Systems
Figure 13.2.
293
Reflective Tutoring Architecture.
best to incorporate each new utterance into the dialogue [Lemon et al., 2002a; Lemon et al., 2002b]. An important difference is that CIRCSIM-Tutor and Atlas/Andes are entirely text-based, whereas ours is a spoken dialogue system (ASR and TTS in Figure 13.2). Our speech interface offers greater naturalness than keyboardbased input, and is also better suited to multimodal interactions than keyboardbased input (namely, one can point and click while talking but not while typing). In this respect, our tutor is similar to COVE [Roberts, 2000], a training simulator for conning Navy ships that uses speech to interact with the student. But whereas COVE uses short conversational exchanges to coach the student during the simulation, our tutor engages in extended tutorial dialogues after the simulation has ended. An additional significant difference between our system and a number of other intelligent tutoring systems is our use of ‘deep’ processing techniques. While other systems utilize ‘shallow’ statistical approaches like Latent Semantic Analysis (e.g. AutoTutor; [Wiemer-Hastings et al., 1999]), our system utilizes Gemini, a parser/generator based on a symbolic grammar. This approach enables us to provide precise and reliable meaning representations. As discussed in the introduction, conversation with intelligent tutors places the following requirements on dialogue management, see [Lemon et al., 2002b; Clark, 1996]:
294
Advances in Natural Multimodal Dialogue Systems
Mixed-initiative: in general, both the student and the tutor should be able to introduce topics; Open-ended: there are not rigid pre-determined goals for the dialogue. In the next four sections, we discuss in more detail the implementation of our system and how the general purpose architecture for intelligent tutoring systems we have developed meets these two demands.
3.
An Architecture for Multimodal Dialogue Systems
To facilitate the implementation of multimodal, mixed-initiative interactions we use the Open Agent Architecture (OAA) [Martin et al., 1999]. OAA is a framework for coordinating multiple asynchronous communicating processes. The core of OAA is a ‘facilitator’ which manages message passing between a number of encapsulated software agents that specialize in certain tasks (e.g., speech recognition). Our system uses OAA to coordinate the following agents: The Gemini NLP system [Dowding et al., 1993]: Gemini uses a single unification grammar both for parsing strings of words into logical forms (LFs) and for generating sentences from LF inputs (although, as mentioned above, we do not use this feature currently). This agent enables us to give precise and reliable meaning representations which allow us to identify dialogue moves (e.g., wh-query) given a linguistic input; e.g., the question “What happened?” has the (simplified) LF: whquery(wh([tense(past),action(happen)])). The Nuance speech recognition server: The Nuance server converts spoken utterances to strings of words. It relies on a language model, which is compiled directly from the Gemini grammar, ensuring that every recognized utterance is assigned a LF [Moore, 1998]. The Festival text-to-speech system [Black and Taylor, 1997]: Festival is a speech synthesizer for system speech output. The Architecture for Conversational Intelligence (ACI) coordinates inputs from the student, interprets the student’s dialogue moves, updates the dialogue context, and delivers speech and graphical outputs to the student (i.e., generation). This agent is discussed in Section 4. The first three agents are ‘off-the-shelf’ dialogue system components (apart from the Gemini grammar, which must be modified for each application). The ACI agent was written in Java for dialogue management applications in general and is described in more detail in Sections 4 and 5. This OAA/Gemini/Nuance/Festival/ACI architecture has also been deployed successfully in
A General Purpose Architecture for Intelligent Tutoring Systems
295
other dialogue systems; e.g., a collaborative human-robot interface [Lemon et al., 2002a; Lemon et al., 2002b].
4.
Activity Models
An important part of dialogue context to be modelled is the tutor’s planned activities (what topics are going to be discussed and how), current activities (what topic is being discussed) and their execution status (pending, cancelled, etc.). Declarative descriptions of the goal decompositions of activities (COLLAGEN’s “recipes”, Atlas/Andes’ “plan operators”, our “Activity Models”) are a vital layer of representation between the dialogue manager and the tutor. Intelligent tutoring systems should be able to plan sequences of atomic actions, based on higher-level input; e.g., a problem, the student’s solution to that problem, and domain knowledge. On the basis of this input, the tutor carries out planning (e.g., an initial overall tutoring plan) and then informs the dialogue manager of the sequences of activities it wishes to perform. The model contains traditional planning constraints such as preconditions of actions (as with the tutoring tactics mentioned in Section 2 and discussed in detail below). Dialogue between the tutor and student can be used to revise the overall tutorial plan and to update the student model. We will briefly discuss both of these possible revisions. The tutor uses information in an annotated record of the student’s performance to construct an initial overall tutoring plan; i.e., what problems (e.g., shipboard crises like fires) are going to be discussed. Our tutor currently makes a list of exemplar crises that occurred in the student’s session with DC-Train. If more than one crisis of a given type occurred, the tutor picks the one with the most errors. The motivation for this particular algorithm is that the student’s knowledge and misconceptions will be reflected in the errors they make and that exemplar crises will make for the most interesting dialogues and the most opportunities for learning. This initial overall tutoring plan can be dynamically revised during the tutorial dialogue; e.g., the student can ask to skip discussion of a particular topic by saying (some variant of) “Can we please move on?” A student model represents the tutor’s estimate of what the student knows (or doesn’t) know and what skills the student has (or hasn’t) mastered. The evidence that is used in constructing a student model is the student’s solution to problems, of course, but also the student’s interaction with the tutor. For example, the student’s answers to the tutor’s questions, the student’s explanations, and the student’s questions to the tutor can all be used to update the student model over time. We are developing one representation and reasoning scheme to cover the spectrum of cases from devices with no planning capabilities to some with more impressive on-board AI, like intelligent tutoring systems. In the tutor
296
Advances in Natural Multimodal Dialogue Systems
def strategy discuss error of omission answer incorrect : goal (did discuss error of omission answer incorrect) : preconditions (i) the student’s answer is incorrect (ii) the student’s actions in response to the damage event included an error of omission : recipe (i) provide negative feedback to the student (ii) give the student a hint (iii) ask a follow-up question (iv) classify the student’s response (v) provide feedback to the student (vi) tell the student the rule (vii) tell the student that the topic is changing Figure 13.3.
Activity Model for Hinting.
we have developed, both the dialogue manager and the tutor have access to a “Task Tree”: a shared representation of planned activities and their execution status. The tree is built top-down by processing a problem, a student’s solution to that problem, and the relevant domain knowledge. The nodes of the tree are expanded by the dialogue manager (via the Activity Models specified for the tutor) until only leaves with atomic actions are left for the tutor to execute in sequence. The tutor and the dialogue manager share responsibility for constructing different aspects of the Task Tree; e.g., the dialogue manager marks transitions between topics (with “OK”, “All right”) while the tutor constructs the overall initial tutoring plan. Tutoring tactics (e.g., hinting) are one type of Activity Model in our tutor. To initiate a tutoring tactic, the student invokes the tutor by responding to a question. The tutor searches the library of tutoring tactics to find all of the tactics whose preconditions are satisfied in the current context. Like the plan operators in other systems (e.g., Atlas/Andes; [Freedman, 2000]), each tutorial tactic has a multi-step recipe [Wilkins, 1988] composed of a sequence of actions. Actions in a recipe can be primitive actions like providing feedback or complex actions like an embedded tutoring tactic. An example tutoring tactic is given in Figure 13.3. For legibility, the key elements are presented in English rather than Java. The object (or goal) of the tutoring tactic is to teach the student about an action they failed to perform (an error of omission). This tactic is used when the student answers the tutor’s question incorrectly and the student’s action in response to a damage event included an error of omission. These are the preconditions on the application of this tutoring tactic. For example, in the dialogue in (3) the student said “The
A General Purpose Architecture for Intelligent Tutoring Systems
297
repair team should fight the fire” (an incorrect answer and the student forgot to isolate the compartment in their session with DC-Train). The method that the tutor uses to teach the student about their error of omission is given by the recipe – a series of sequential system actions. The tutor utilizes information in a Causal Story Graph (CSG, an annotated description of the student’s performance in their session with DC-Train) to decide which tutorial tactic is appropriate with respect to a student’s response to a particular question. For example, in Figure 13.3, the tutor uses, in the preconditions on the application of the tutoring tactic, the information in the CSG which classifies the relevant action as an error of omission, in addition to the classification of the student’s response as an incorrect answer. The preconditions on other tutoring tactics will involve different combinations of action and response classification. Hence, it is the combination of the classification of a student’s response (as correct, incorrect, etc.) and action in a DC-Train session (as an error of omission, error of commission, etc.) which determine which tutoring tactic the tutor uses to teach the student. The tutor then adds a node to the Task Tree describing the tutoring tactic. The tutoring tactic specifies what atomic actions should be invoked (e.g., feedback), and under what conditions they should be invoked. For example, in Figure 13.3, the tutoring tactic states that that tutor should, provide negative feedback to the student; give the student a hint; ask a follow-up question; classify the student’s response; provide feedback to the student; tell the student the rule; tell the student that the topic is changing. Nodes on the Task Tree can be active, completed, or cancelled. Any change in the state of a node (e.g., because of a question from the student) is added to the System Agenda (the stack of issues to be raised by the system – see below). For example, in (4), the student asks the tutor a “Why”-question following the tutor’s explanation. The tutor responds to the student’s question by adding the student’s question to the Task Tree and marking it as active. The tutor then asks the student if they would like the tutor to display the relevant section of doctrine. After the student has responded, the node corresponding to the student’s question is marked completed. (4)
Tutor:
You should request permission from the EOOW to start a
298
Advances in Natural Multimodal Dialogue Systems
firepump. But you didn’t do this during the session. Remember, you need the EOOW’s permission before starting a firepump. Student: Why? Tutor: Well. This is specified in damage control doctrine. Would you like me to display the relevant section of doctrine? Student: Yes. Tutor: OK. [doctrine is displayed] Let me know when you’re ready to continue.
5.
Dialogue Management Architecture
Dialogue Management with the ACI makes use of several recent ideas in dialogue modelling, described in detail in [Lemon et al., 2002a; Lemon et al., 2002b]. Much of what follows in this section is an adaptation of the discussion in [Lemon et al., 2002b]. The Dialogue Manager creates and updates an Information State, corresponding to a notion of dialogue context. Dialogue moves (e.g., wh-query, wh-answer) update information states. A student’s dialogue move might send a response to the tutor, elicit an assertion by the tutor, or prompt a follow-up question. The tutor itself generates dialogue moves that are treated just like the student’s conversational contributions. The ACI includes the following dynamically updated components, see [Lemon et al., 2002a; Lemon et al., 2002b] for full details: The Dialogue Move Tree: a structured history of dialogue moves and ‘threads’, plus a list of ‘active nodes’; The Task Tree: a temporal and hierarchical structure of activities initiated by the system or the user, plus their execution status; The System Agenda: the issues to be raised by the system; The Salience List: the objects referenced in the dialogue thus far, ordered by recency; The Pending List: the system’s questions asked but not yet answered by the student; The Modality Buffer: stores gestures for later resolution. Figure 13.4 shows an (edited) example of an Information State logged by our system, displaying the interpretation of “I should send repair two to fight the fire”. Dialogue management involves a set of domain-independent dialogue move types (e.g., wh-query, wh-answer, etc.; [Ginzburg et al., 2001]). A dialogue
A General Purpose Architecture for Intelligent Tutoring Systems
299
Figure 13.4. Information State (Dialogue Move Tree).
with the system generates a particular Dialogue Move Tree (DMT). The DMT provides a representation of the current state of the conversation in terms of a structured history of dialogue moves. Each node is an instance of a dialogue move type and is linked to a node on the Task Tree, where appropriate. Further, the DMT determines whether or not user input can be interpreted in the current dialogue context, and how to interpret it. Incoming logical forms (LFs) are tagged with a dialogue move type. For example, the LF wh-query(wh([tense(past),action(happen)])) corresponds to the utterance “What happened”, which has the dialogue move type wh-query. How are dialogue moves related to the current context? We use the DMT to answer this question: A DMT is a history or “message board” of dialogue contributions, organized by “thread”, based on activities. In our tutor, threads correspond to topics like individual ship crises. A DMT classifies which incoming utterances can be interpreted in the current dialogue context, and which cannot be. It thus delimits a space of possible Information State update functions. A DMT has an Active Node List (ANL) which controls the order in which this function space is searched. A DMT classifies how incoming utterances are to be interpreted in the current dialogue context.
300
Advances in Natural Multimodal Dialogue Systems
A particular Dialogue Move Tree can be understood as a function space of dialogue Information State update functions of the form f : Node × Conversational Move → Information State Update where Node is an active node on the dialogue move tree, a Conversational Move is a structure (Input Logical Form, Activity Tag, Agent) and an Information State Update is a function g : IS → IS which changes the current IS. The details of the update function are determined by the node type (e.g., wh-query) and the incoming dialogue move type (e.g., wh-answer) and its content, as well as the value of the Activity Tag. This technique of modelling dialogue context is a variant of “conversational games” (or “dialogue games”; [Carlson, 1983]) and, in the context of task-oriented dialogues like tutoring, “discourse segments” [Grosz and Sidner, 1986]. Both of these accounts of dialogue context rely on the observation that answers generally follow questions, commands are generally acknowledged, so that dialogues can be partially described in terms of “adjacency pairs” of such dialogue moves. The ACI’s notion of Attachment embodies this idea. The two main steps of the algorithm controlling dialogue management are Attachment and Process Node: Attachment: processes incoming input Conversational Move c with respect to the current DMT and Active Node List, and “attach” a new node N interpreting c to the tree if possible (i.e., find the most active node on the DMT of which the new node can be a daughter, and add the new node at that location). Process node: process the new node N, if it exists, with respect to the current information state. Perform an Information State update using the dialogue move type and content of N. The effect of an update function depends on the input conversational move c (in particular, the dialogue move type and the contents of the logical form) and the node of the DMT that it attaches to. The possible attachments can be thought of as adjacency pairs, see [Levinson, 1983], paired speech acts which organize the dialogue locally. Each dialogue move class contains information about which node types it can attach. Some examples of different attachments available in the current version of our tutor can be seen in Figure 13.5 (where activity tags are not specified, attachment does not depend on the sharing of an activity tag, as with the node type not-recognized). For example, the fourth entry in the table states that a wh-query generated by the tutor, with the activity tag t, is able to attach any wh-answer by the student with that same activity tag. Similarly, the row for explanation states that any explanation by the tutor, with the activity tag t, can attach a why-query by the student.
301
A General Purpose Architecture for Intelligent Tutoring Systems Node Type
Activity Tag
Speaker
Attaches
noun-query
t
Tutor
wh-answer(t, user)
why-query
t
Student
report(t, system)
yn-query
t
Tutor
yn-answer(t, user)
wh-query
t
Tutor
wh-answer(t, user)
yn-answer
t
Student
report(t, system)
wh-answer
t
Student
report(t, system)
explanation
t
Tutor
why(t, user)
Student
pardon(system)
not-recognized Figure 13.5.
Attachment in the Dialogue Move Classes.
The possible attachments summarized in Figure 13.5 constrain the ways in which DMTs can grow, and thus classify the dialogue structures that can be captured in the current version of our tutor. As new dialogue move types are added to the tutor, this table will be extended to cover a greater range of dialogue structures. Note that the tutor’s dialogue moves appear on the DMT, just as the student’s do. Recall the requirements placed on automated tutors discussed in Section 2. The DMT structure is able to interpret both the student and tutor input as dialogue moves at any time, thus allowing for mixed-initiative. Further, the DMT can handle dialogues with no clear endpoint (open-ended). In the next section, we discuss further benefits of the ACI for intelligent tutoring systems, both in the domain of shipboard damage control and in general.
6.
Benefits of ACI for Intelligent Tutoring Systems
There are several benefits to CSLI’s dialogue management architecture [Lemon et al., 2002b]: The dialogue management architecture is reusable across domains. As mentioned, the same architecture has been successfully implemented in an unmanned helicopter interface [Lemon et al., 2002a; Lemon et al., 2002b]. The Activity Models - e.g., the properties of the relevant activities - and (some aspects of) the grammar will have to be changed across domains. The Dialogue Move Tree/Task Tree distinction allows one to capture the notion that dialogue works in service of the activity the participants are engaged in. That is, the structure of the dialogue, as reflected in the Dialogue Move Tree, is a by-product of other aspects of the dialogue
302
Advances in Natural Multimodal Dialogue Systems
management architecture; e.g., the Activity Models. The Dialogue Move Tree/Task Tree distinction is supported by recent theories of dialogue; e.g., Clark’s [1996] joint activity theory of dialogue. The dialogue move types are domain-general, and thus reusable in other domains. The architecture supports multimodality with the Modality Buffer. For example, we are able to coordinate linguistic input and output (e.g., speech) with non-linguistic input and output (e.g., the student can indicate a region of the ship display with a mouse click or the tutor can highlight a compartment).
7.
Conclusion
We began by identifying two properties of tutorial interaction that a dialogue system must capture: mixed-initiative and open-endedness. We then explained the domain-general modelling techniques we used to build an intelligent tutoring system for damage control. This tutor is novel in that it is, as far as we know, the first spoken intelligent tutoring system. Our speech interface offers greater naturalness than text-based intelligent tutors and is better suited to multimodal interactions. We discussed CSLI’s dialogue management architecture and the algorithms we used to develop a tutor that is multimodal, mixed-initiative, and allows for open-ended dialogues.
7.1
Future Work
We are currently expanding the tutoring module to support a wider range of tutoring tactics and strategies. Some of these tactics and strategies are specific to damage control. Other tactics and strategies are domain-general. We plan to evaluate how well CSLI’s dialogue management architecture (the Dialogue Move Tree and the Activity Models we have developed for our tutor) handles tutorial dialogues in other domains; e.g., basic electricity and electronics or algebra.
7.2
Evaluation
As part of the research described here, we have done only informal evaluation of the system so far.1 Three former damage control assistants, with no previous experience using our tutor, completed a tutorial session. Each subject 1 Since
the completion of this chapter, we have performed a formal evaluation of our system. Pon-Barry et al. [2004] describe this evaluation and present preliminary results showing the effectiveness of our spoken conversational tutor as a learning tool.
A General Purpose Architecture for Intelligent Tutoring Systems
303
was able to complete a dialogue with our system, and data has been collected, including speech recognition error rates. All three dialogues were recorded, and the Information States logged. The WITAS "sister" system [Lemon et al., 2002a; Lemon et al., 2002b] described in the introduction has undertaken user tests and evaluation, and also adopted a "targeted help" approach to dealing with speech recognition errors [Hockey et al., 2003]. It would be easy to apply similar techniques for the tutoring system described in this chapter, given that the WITAS system and the tutoring system utilize the same underlying general purpose architecture. We are planning two large evaluation efforts in the near future, applying the techniques mentioned above: one at Stanford University, the other at the Naval Postgraduate School in Monterey, CA. We are also planning several psycholinguistic experiments utilizing our tutor. Broadly, we would like to find out if students learn better with a spoken environment. If so, is that directly because of the naturalness of the modality, or because speech encourages longer utterances and more student initiative?
Acknowledgements This work is supported by the Department of the Navy under research grant N000140010660, a multidisciplinary university research initiative on natural language interaction with intelligent tutoring systems. The Architecture for Conversational Intelligence was funded under the Wallenberg laboratory for research on Information Technology and Autonomous Systems (WITAS) Project, Linköping University, by the Wallenberg Foundation, Sweden.
References Black, A. and Taylor, P. (1997). Festival Speech Synthesis Systems: System Documentation (1.1.1). Technical Report HCRC/RT-83, Human Communication Research Centre, University of Edinburgh. Bulitko, V. V. and Wilkins, D. C. (1999). Automated Instructor Assistant for Ship Damage Control. In Proceedings of the Eleventh Conference on Innovative Applications of Artificial Intelligence (IAAI), pages 778–785, Orlando, Florida. Carlson, L. (1983). Dialogue Games: An Approach to Discourse Analysis. Dordrecht: Reidel. Chi, M. T. H., Siler, S., Jeong, H., Yamauchi, T., and Hausmann, R. G. (2001). Learning from Human Tutoring. Cognitive Science, 25:471–533. Clark, B., Fry, J., Ginzton, M., Peters, S., Pon-Barry, H., and Thomsen-Gray, Z. (2001). A Multimodal Intelligent Tutoring System for Shipboard Damage Control. In Proceedings of International Workshop on Information Presentation and Multimodal Dialogue (IPNMD), pages 121–125, Verona, Italy.
304
Advances in Natural Multimodal Dialogue Systems
Clark, H. H. (1996). Using Language. Cambridge: Cambridge University Press. Dowding, J., Gawron, J., Appelt, D., Bear, J., Cherny, L., Moore, R. C., and Moran, D. (1993). Gemini: A Natural Language System for Spoken-Language Understanding. In Proceedings of the ARPA Workshop on Human Language Technology. Freedman, R. (2000). Plan-Based Dialogue Management in a Physics Tutor. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP), pages 52–59, Seattle, Washington, USA. Ginzburg, J., Sag, I. A., and Purver, M. (2001). Integrating Conversational Move Types in the Grammar of Conversation. In Proceedings of the Fifth Workshop on Formal Semantics and Pragmatics of Dialogue (BI-DIALOG), pages 45–56, Bielefeld, Germany. Graesser, A. and Person, N. K (1994). Question Asking during Tutoring. American Educational Research Journal, 31:104–137. Grosz, B. and Sidner, C. (1986). Attentions, Intentions, and the Structure of Discourse. Computational Linguistics, 12(3):175–204. Hockey, B., Lemon, O., Campana, E., Hiatt, L., Aist, G., Hieronymus, J., Gruenstein, A., and Dowding, J. (2003). Targeted Help for Spoken Dialogue Systems: Intelligent Feedback Improves Naïve Users’ Performance. In Proceedings of the Tenth Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 147–154, Budapest, Hungary. Katz, S. and Allbritton, D. (2002). Going Beyond the Problem Given: How Human Tutors Use Post-practice Discussions to Support Transfer. In Cerri, S. A., Gouardères, G., and Paraguaçu, F., editors, Proceedings of the Sixth International Conference on Intelligent Tutoring Systems (ITS), pages 641– 650. Berlin: Springer. Katz, S., O’Donnell, G., and Kay, H. (2000). An Approach to Analyzing the Role and Structure of Reflective Dialogue. International Journal of Artificial Intelligence in Education, 11:320–343. Lemon, O., Gruenstein, A., Battle, A., and Peters, S. (2002a). Multi-Tasking and Collaborative Activities in Dialogue Systems. In Proceedings of the Third SIGdial Workshop on Discourse and Dialogue, pages 131–124, Philadelphia, USA. Lemon, O., Gruenstein, A., and Peters, S. (2002b). Collaborative Activities and Multi-Tasking in Dialogue Systems. Traitement Automatique des Langues (TAL, special issue on dialogue), 43(2):131–154. Levinson, S. (1983). Pragmatics. Cambridge: Cambridge University Press. Martin, D., Cheyer, A., and Moran, D. (1999). The Open Agent Architecture: A Framework for Building Distributed Software Systems. Applied Artificial Intelligence, 13:1–2.
A General Purpose Architecture for Intelligent Tutoring Systems
305
Moore, R. (1998). Using Natural Language Knowledge Sources in Speech Recognition. In Proceedings of the NATO Advanced Study Institute (ASI). Pon-Barry, H., Clark, B., Bratt, E. O., Schultz, K., and Peters, S. (2004). Evaluating the Effectiveness of SCoT: A Spoken Conversational Tutor. In Mostow, J. and Tedesco, P., editors, Proceedings of the ITS Workshop on Dialogue-based Intelligent Tutoring Systems: State of the Art and New Research Directions, pages 23–32, Maceio, Brazil. Roberts, B. (2000). Coaching Driving Skills in a Shiphandling Trainer. In Proceedings of the AAAI Fall Symposium on Building Dialogue Systems for Tutorial Applications, page 150, North Falmouth, Massachusetts. Shah, F., Evens, M., Michael, J., and Rovick, A. (2002). Classifying Student Initiatives and Tutor Responses in Human Keyboard-to-Keyboard Tutoring Sessions. Discourse Processes, 32:23–52. Sweller, J., van Merriënboer, J., and Paas, F. (1998). Cognitive Architecture and Instructional Design. Educational Psychology Review, 10(3):251–296. Wiemer-Hastings, P., Wiemer-Hastings, K., and Graesser, A. (1999). Improving an Intelligent Tutor’s Comprehension of Students With Latent Semantic Analysis. In Lajoie, S. P. and Vivet, M., editors, Proceedings of Artificial Intelligence in Education (AIED), pages 535–542. Amsterdam: IOS Press. Wilkins, D. (1988). Practical Planning: Extending the Classical AI Planning Paradigm. San Mateo, CA: Morgan Kaufmann. Zhou, Y., Freedman, R., Glass, M., Michael, J., Rovick, A., and Evens, M. (1999). Delivering Hints in a Dialogue-Based Intelligent Tutoring System. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI), pages 128–134, Orlando, Florida. AAAI Press and MIT Press. Published in one volume with Eleventh Conference on Innovative Applications of Artificial Intelligence (IAAI).
Chapter 14 MIAMM – A MULTIMODAL DIALOGUE SYSTEM USING HAPTICS Norbert Reithinger, Dirk Fedeler DFKI – German Research Center for Artificial Intelligence Saarbrücken, Germany
{Norbert.Reithinger, Dirk.Fedeler}@dfki.de
Ashwani Kumar LORIA, Nancy, France [email protected]
Christoph Lauer, Elsa Pecourt DFKI – German Research Center for Artificial Intelligence Saarbrücken, Germany
{Christoph.Lauer, Elsa.Pecourt}@dfki.de
Laurent Romary LORIA, Nancy, France [email protected]
Abstract
In this chapter we describe the MIAMM project. Its objective is the development of new concepts and techniques for user interfaces employing graphics, haptics and speech to allow fast and easy navigation in large amounts of data. This goal poses challenges as to how can the information and its structure be characterized by means of visual and haptic features, how the architecture of such a system is to be defined, and how we can standardize the interfaces between the modules of a multi-modal system.
Keywords:
Multimodal dialogue system, haptics, information visualization.
307 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 307–332. © 2005 Springer. Printed in the Netherlands.
308
1.
Advances in Natural Multimodal Dialogue Systems
Introduction
Searching in information services for a certain piece of information is still exhausting and tiring. Imagine a user who has a portable 30 GB MP3 player. All titles are attributed with the most recent metadata information. Now he wants to search through the thousands of titles he has stored in the handheld device, holding it with one hand and possibly operating the interface with the other one, using some small keyboard, handwriting recognition or similar means. He can enter the search mode and select one of the main music categories, the time interval, or a large number of genres and format types. Scrolling through menu after menu is neither natural nor user adapted. Even recent interface approaches like the iPod navigation solve these problems only partially, or even negate it, like the iPod shuffle, which defines randomness and lack of user control as a cool feature. Basically, we have two problems here: a user request that must be narrowed down to the item the user really wants, and the interface possibilities of such a small device. If we apply a (speech-) dialogue interface to this problem, the dialogue to extract exactly the title the user wants might be very lengthy. On the other hand, a menu-based interface is too time consuming and cumbersome due to the multitude of choices, and not very usable on a mobile device, due to dependence on graphical input and output. The main objective of the MIAMM project (http://www.miamm.org/)1 is to develop new concepts and techniques in the field of multimodal interaction to allow fast and natural access to such multimedia databases (see [Maybury and Wahlster, 1998] for a general overview on multimodal interfaces). This implies both the integration of available technologies in the domain of speech interaction (German, French, and English) and multimedia information access, and the design of novel technology for haptic designation and manipulation coupled with an adequate visualisation. In this chapter we will first motivate the use of haptics and present the architecture of the MIAMM system. The visualization of the data guides the haptic interaction. We introduce the main concepts that use conceptual spaces and present the current visualization possibilities. Then we introduce the dialogue management approach in MIAMM, that divides into multimodal fusion and action planning. Finally, we give a short introduction to MMIL, the data exchange and representation language between the various modules of the system.
1 The
project MIAMM was partially funded by the European Union (IST-2000-29487) from 2001 – 2003. The partners are LORIA (F, coordinating), DFKI (D), Sony Europe (D), Canon (UK), and TNO (NL). The responsibility for this contribution lies with the authors.
MIAMM – A Multimodal Dialogue System using Haptics
2. 2.1
309
Haptic Interaction in a Multimodal Dialogue System Haptics as a New Modality in Human-Computer Interaction
One of the basic senses of humans is the haptic-tactile sense. In German, to understand can be uttered as begreifen – to grip – indicating that you really command a topic only after thoroughly touching it. How things feel like, how parts of a mechanism interact, or which feedback an instrument provides, are important cues for the interaction of humans in their natural and technological environment. Not surprisingly, the tactile and sensory motoric features of a new product are traditionally included in the design decisions of manufacturers, e.g. of carmakers. Also in areas like remote control haptics is commonly considered as an important interaction control and feedback channel 2 . Therefore, it is surprising that this modality only recently gains attention in the human computer interaction community. One can speculate whether the disembodied world of zeroes and ones in the computer distances us too much from the real world. However, with the advent of advanced graphical virtual worlds, getting embodied feedback is more and more important. A forerunner of this trend, as in many other areas, is video gaming where the interaction with the virtual world calls for physical feedback. Over the last years, force-feedback wheels and joysticks provide the players with feedback of his interaction with the game world. While these interactions manipulate virtual images of real scenes, our goal in MIAMM is to interact in complex and possibly unstructured information spaces using multiple modalities, namely speech and haptics. Speech dialogue systems are nowadays good enough to field them with simple tasks. The German railway, for example, split their train timetable service in 2002 in a free-ofcharge speech dialogue system and a premium cost, human operated service. Haptic interaction in dialogue systems is rather new, however. Basically we are facing the following challenges: How do we visualize information and its structure? Which tactile features can we assign to information? How can we include haptics in the information flow of a dialogue system? We will address these questions in the sections below. To give an impression of the envisioned end-user device, the (virtual) handheld MIAMM appliance is shown in Figure 14.1. The user interacts with the 2 See
e.g. http://haptic.mech.nwu.edu/ for references.
310
Advances in Natural Multimodal Dialogue Systems
device using speech and/or the haptic buttons to search, select, and play tunes from an underlying database. The buttons can be assigned various feedback functions. Haptic feedback can also provide e.g. the rhythm of the tune currently in focus through tactile feedback on the button. On the top-right side is a jog dial that can also be pressed. All buttons provide force feedback, depending on the assigned function and visualization metaphor.
Figure 14.1.
2.2
The simulated PDA device.
The Architecture of the MIAMM System
The timing of haptic interaction is another, not only technical challenge. Let’s consider the physiology of the sensory motoric system: the receptors for pressure and vibration of the hand have a stimulus threshold of 1µm, and an update frequency of 100 to 300 Hz [Beyer and Weiss, 2001]. Therefore, the feedback must not be delayed by any time consuming reasoning processes to provide a realistic interaction: if the haptic feedback reaction of the system is delayed beyond the physiological acceptable limits, it will be an unnatural interaction experience.
MIAMM – A Multimodal Dialogue System using Haptics
311
Therefore, processing and reasoning time plays an important role in haptic interaction that has to be addressed in all processing stages of MIAMM. In 2001, the participants of the Schloss Dagstuhl workshop “Coordination and Fusion in Multimodal Interaction” 3 discussed in one working group architectures for multimodal systems (WG 3). The final architecture proposal follows in major parts the “standard” architecture of interactive systems, with the consecutive steps mode analysis, mode coordination, interaction management, presentation planning, and mode design. For MIAMM we discussed this reference architecture and checked its feasibility for a multimodal interaction system using haptics. We came to the conclusion that a more or less pipelined architecture does not suit the haptic modality. For modalities like speech, no immediate feedback is necessary: you can use deep reasoning and react in the time span of about one second. As a consequence, our architecture (see Figure 14.2) considers the modality specific processes as modules which may have an internal life of their own: only important events must be sent to the other modules, and modules can ask about the internal state of other modules.
Figure 14.2.
The MIAMM Architecture.
The system consists of two modules for natural language input processing, namely recognition and interpretation. On the output side we have a MP3player to play the tunes, and pre-recorded speech prompts to provide acoustic 3 See
http://www.dfki.de/∼wahlster/Dagstuhl Multi Modality/ for the presentations.
312
Advances in Natural Multimodal Dialogue Systems
feedback. The visual-haptic-tactile module (VisHapTac) is responsible for the selection of the visualization and for the assignment of haptic features to the force-feedback buttons. The visualization module renders the graphic output and interprets the force to the haptic buttons imposed by the user. The results are communicated back to the visual-haptic-tactile module. The dialogue manager consists of two main blocks, namely the multimodal fusion which is responsible for the resolution of multimodal references and of the action planner. A simple dialogue history provides contextual information. The action planner is connected via a domain model to the multi-media database. The domain-model inference engine facilitates all accesses to the database. In the case of the language modules, where reaction time is important, but not vital for a satisfactory experience of the interaction, every result, e.g. an analysis from the speech interpretation, is forwarded directly to the consuming agent. The visual-haptic and the visualization modules with their real-time requirements are different. The dialogue manager passes the information to be presented to the agent, which determines the visualization. It also assigns the haptic features to the buttons. The user can then use the buttons to operate on the presented objects. As long as no dialogue intention is assigned to a haptic gesture, all processing will take place in the visualization module, with no data being passed back to the dialogue manager. Only if one of these actions is e.g. a selection, it passes back the information to the dialogue manager via the visual-haptic-tactile module autonomously. If the multimodal fusion needs information about objects currently in the visual focus, it can ask the visualhaptic agent. The whole system is based partly on modules already available at the partner institutions, e.g. speech recognizers, speech interpretation or action planning, and modules that are developed within the project. The haptic-tactile interaction uses multiple PHANToM devices (http://www.sensable.com/), simulating the haptic buttons. The graphic-haptic interface is based on the GHOST software development kit provided by the manufacturer of the PHANToMs. The 3-D models for the visualizations are imported via an OpenGL interface from a modelling environment. The inter-module communication is based on the “Simple Object Access Protocol” (SOAP), a W3C recommendation for a lightweight protocol to exchange information in a decentralized, distributed environment. However, since the protocol adds a significant performance penalty to the system, we developed a solution that uses the message structure of SOAP, but delivers messages directly, if all modules reside in the same execution environment.
MIAMM – A Multimodal Dialogue System using Haptics
3.
313
Visual Haptic Interaction – Concepts in MIAMM
The aim of the Visual Haptic Interaction (VisHapTac) module in the MIAMM system is to compute the visualization for the presentation requested by the dialogue management. Therefore, it has to find an adequate way to display a given set of data and to provide the user with intuitive manipulation features. This also includes the interpretation of the haptic user input. To do this VisHapTac has to analyse the given data with respect to predefined characteristics and it has to map them to the requirements of visualization metaphors. In the next paragraphs we show briefly what visualization metaphors are, which metaphors we use in the MIAMM project, and which requirements they have to fulfil. We discuss also which data characteristics are suitable for the system, where they come from and how they influence the selection for a visualization metaphor. A general overview on visualization techniques is to be found e.g. in [Card et al., 1999].
Figure 14.3. The wheel visualization.
314
3.1
Advances in Natural Multimodal Dialogue Systems
Visualization Metaphors
A visualization metaphor (based on the notion of Conceptual Spaces, see [Gärdenfors, 2000] is a concept for the information presentation related to a real world object. Manipulating the presented data should remind the user to the handling of the corresponding object. An example is a conveyor belt where things are put in a sequence. This metaphor can be used for presenting a list of items. Scrolling up or down in the list is then represented by turning the belt to one or the other side. For the MIAMM project we use the following visualization metaphors (as presented in [Fedeler and Lauer, 2002]):
Figure 14.4.
The timeline visualization.
3.1.1 The visualization metaphor “conveyor belt/wheel”. The wheel visualization displays a list that can endlessly be scrolled up and down with the haptic buttons. The user can feel the clatter of the wheel on the buttons. The “conveyor belt/wheel” metaphor is used as described above with a focus area in the middle of the displayed part of it. It is suitable for a one-dimensional, not
MIAMM – A Multimodal Dialogue System using Haptics
315
necessarily ordered set of items. So it is one of the less restricted visualization metaphors, which means that the wheel is a good candidate to be the default visualization for every kind of incoming data, when there is no good criterion for ordering or clustering information. Also for a small set of items (less than 30) this metaphor gives a good overview of the data.
3.1.2 The visualization metaphor “timeline”. The timeline visualization is used for visualizations, where one data dimension is ordered and has sub-scales. One example is date information with years and months. The user stretches and compresses the visible time scope like a rubber band using the haptic buttons, feeling the resistance of the virtual material. Usually, in the middle of the visualized part of the timeline a data entry is highlighted to show the focussed item. The user can select this highlighted item for the play list, or can directly play it, e.g., by uttering “Play this one”.
Figure 14.5.
The lexicon visualization.
316
Advances in Natural Multimodal Dialogue Systems
3.1.3 The visualization metaphor “lexicon”. The lexicon visualization displays a sorted set of clustered items similar to the “rolodex” file card tool. One scalar attribute of the items is used to cluster the information. For example, the tunes can be ordered alphabetically using the singer’s name. Each item is shown on a separate card and separator cards labelled with the first letter divide the items with different first letters. Since only one card is shown at a time detailed descriptions of the item can be presented using this visualization. The navigation in this visualization is similar to the wheel. The user browses through the cards by rotating the rolodex with the buttons and the dial. A stronger pressure increases the speed of the rotation.
Figure 14.6.
The map visualization.
3.1.4 The visualization metaphor “map/terrain”. The map or terrain visualization metaphor clusters information according to the main characteristics – in the example figure according to genres and subgenres – and groups them in neighbourhoods. A genetic algorithm with an underlying physical model generates the map. It guarantees that different characteristics are in distant areas of the map, while common structures are in a near neighbourhood.
MIAMM – A Multimodal Dialogue System using Haptics
317
The user navigates through the map with the buttons, “flying” through the visualization. He can zoom into the map and finally select titles. This visualization is especially useful to present two-dimensional information, which has inherent similarities. Distance and connections between the separate clusters can be interpreted as relations between the data.
3.2
Data Characteristics
The basic step when choosing a visualization metaphor is to characterise the underlying data. Some important characteristics are: Numeric, symbolic (or mixed) values; Scalar, vector or complex structure; Unit variance; Ordered or non-ordered data sets; Discrete or continuous data dimensions; Spatial, quantity, category, temporal, relational, structural relations; Dense or sparse data sets; Number of dimensions; Available similarity or distance metrics; Available intuitive graphical representation (e.g. temperature with colour); Number of clusters, that can be built and how the data is spread over them. The domain model of MIAMM is the main source of this information. It models the domain of music titles utilizing some of the MPEG-7 data categories. In the description of the model the applicable data characterization for each information type are stored. Additional information, for instance about how many possible items there are for an attribute, has also to be examined. This can be used, e.g. for clustering a data set. The visualization metaphors, too, have to be reviewed in order to get information about their use with the various characteristics, which therefore define the requirement for a visualization metaphor. Requirements are strongly depending on the virtual objects a visualization denotes. As an example, the virtual prototype with the “conveyor belt” metaphor as it is shown above can display about ten items, so the list should be limited to about 30 items to be manageable for the user on a PDA while the map visualizes the whole database.
318
3.3
Advances in Natural Multimodal Dialogue Systems
Planning the Presentation and the Interaction
When a new presentation task is received from the dialogue manager it has to be planned how the content data will be displayed and how the user will interact with the visualization using the haptic buttons. This planning process consists of the following steps: 1 The incoming data has to be analysed with respect to the characteristics stored in the domain model. Also the size of the given data set is an important characteristic as some visualization metaphors are to be preferred for small data sets, as shown in the example above. It has to be examined whether the data can be clustered with respect to the different attributes of the items. To estimate how useful the different kinds of cluster building are, the number and size of the clusters is important. For instance, a handful of clusters with the data nearly equally spread between them can give a good overview of the presented information. 2 A mapping has to be found between the characteristics of the data and the requirements of the visualization metaphors. Therefore a kind of constraint solver processes this data in several steps. (a) The necessary characteristics and requirements are processed first. They are formulated as constraints in advance as they only depend on the non-dynamic part of the visualization metaphors (see above: “data characteristics”). (b) Strongly recommended information – if available – is added. This could be user preferences or information for the coherence of the dialogue. One example is to use the same visualization metaphor for the same kind of data. (c) If there are additional preferences like button assignment – e.g., using the index finger for marking and not the thumb – they are processed in the last step. In addition to the selection of a metaphor, a list of configurations and meta information is computed which will be used for further initialising the visualization. Then the content data is reformulated with respect to the selected visualization metaphor including the additional information and provided to the following sub module. The next step in the processing is the visualization/rendering module, which computes a visualization from a graphics library of metaphors and fills in the configuration data containing the content (‘what to show’) and the layout (‘how to show’), including the layout of the icons on the PDA’s screen. It then initiates the interaction, i.e. it provides the call-backs that map the user’s haptic
319
MIAMM – A Multimodal Dialogue System using Haptics
input to the visualization routines. If the user presses the button, the tight coupling of graphic elements to functions processing the response enables the immediate visual and haptic-tactile feedback.
4.
Dialogue Management
4.1
Architecture of the Dialogue Manager
The Dialogue Manager (DM) plays a central role within the MIAMM architecture, as it is the module that controls the high-level interaction with the user, as well as the execution of system internal actions like database access. Its tasks are the mapping of the semantic representations from the interpretation modules onto user intentions, the update of the current dialogue context and task status on the basis of the recognized intentions, the execution of the actions required for the current task (e.g. database queries), and finally the generation of an output through the output layers such as speech, graphics and haptic feedback. The DM is required to cope with possibly incomplete, ambiguous or wrong inputs due to linguistic phenomena like anaphora or ellipsis, or to errors in previous layers. Still in these situations the DM should be able to provide an appropriate answer to the user, resolving the ambiguities or initiating a clarification dialogue in the case of errors and misunderstandings. Multimodality poses an additional challenge, as inputs in different modalities, possibly coming asynchronously, have to be grouped and assigned a single semantic value.
linguistic semantic representation
Multi-modal Fusion (MMF)
Dialogue history - Context history - User Preference - Modality history
- Context Frame - Context Model
visual haptic semantic representation
semantic representation presentation task
update
Action Planner (AP) - Action model - Conceptual model - Query representation
Figure 14.7.
query result
Domain Model (MiaDoMo)
Functional architecture of the Dialogue Manager.
320
Advances in Natural Multimodal Dialogue Systems
Based upon these functional requirements, the DM is decomposed in two components (see Figure 14.7): the multimodal Fusion component (MMF) and the Action Planner (AP). Semantic representations coming from the Speech Interpretation (Spin) and the Visual Haptic Interaction (VisHapTac) modules are first disambiguated and fused by MMF, and then sent to AP. AP computes the system response and sends the required queries to the corresponding modules. Queries to the MIAMM database and to the devices are done through the domain model (MiaDoMo). The AP also sends presentation tasks to VisHapTac, to the MP3 Player, or activates speech prompts. All data flowing between modules, including communication between the DM components, is defined using MMIL, the data interchange format in MIAMM (see Section 5). The underlying motivations for the decoupling of DM and MMF are first to account for modularity within the DM design framework to enable an integrative architecture, and second to provide for sequential information flow within the module. This aspect is crucial in multimodal systems, as the system cannot decide on action execution until all unimodal information streams that constitute a single message are fused and interpreted within a unified context. The functionality and design of the dialogue management components are outlined in the next two sections.
4.2
Multimodal Fusion
MMF assimilates information coming through various modalities and submodules into a comprehensive and unambiguous representational framework. Ideally, output of the MMF is free from all kinds of ambiguities, uncertainties and terseness. More specifically, MMF: Integrates discursive and perceptual information, which at the input level of MMF is encoded using lexical and/or semantic data categories as specified by the MMIL language; Assigns a unique MMILId, each time a new object enters into the discourse. This id serves as an identifier for the object within the scope and timeline of the discourse; Resolves ambiguities and uncertainties at the level of semantics; Updates the dialogue history, triggered by the user’s utterances and various updates from other modules within MIAMM architecture. Effectively, from a functional point of view the design of MMF can be divided into three mechanisms, which are further described in the following subsections.
MIAMM – A Multimodal Dialogue System using Haptics
321
This is the first step towards analysis of the se4.2.1 Interpretation. mantic representation provided by Speech and VisHapTac layers, so as to identify semantically significant entities (discursive and perceptual) in the user’s input. These discourse entities serve as potential referents for referring expressions. For example: in the user’s utterance show me the list MMF identifies relational predicates such as /subject/, /object/ etc. and corresponding arguments such as show, the list etc. as semantically significant entities and these discourse entities are accommodated into the live4 discourse context. Essentially, every information unit within the MMIL semantic representation serves as a cognitive model of an entity5 . A typical minimal representation for an entity contains: A unique identifier; Type category for the entity. Type is derived from a set of generic domains organized as type hierarchy, which is established in the Domain Model. We incorporate these representations into a cognitive framework named as Reference Domains [Salmon-Alt, 2000], which assimilates and categorizes discursive, perceptual and conceptual (domain) information pertaining to the entities. On the basis of the information content within the structures representing these entities, a reference domain is segmented into zero, one or more partitions. These partitions map access methods to reference domains and are used for uniquely identifying the referents. Usually, perceptual and discursive prominence of the entities enables to single out a particular entity within a partition. Effectively, these prominence attributes are incorporated by the specific operation of assimilation on the pertinent reference domains. Triggered by discursive cues (e.g. prepositions, conjunctions, quantified negations, arguments of same predicate), assimilation builds associations (or disassociations) between entities or sets. Assimilation could be perceptually triggered (e.g. graphics and haptics triggers) as well. For example, the user can command play this one, while haptically selecting an item from the displayed play list. The haptic trigger would entail assimilation of the participants of type /tune/ into a single reference domain and modifying its status to /infocus/. In other scenarios, when we have different kind of visualizations such as galaxy, perceptual criteria such as proximity and similarity can lead to grouping of contextual entities. Depending on the type of trigger, an entity or a set can be made prominent but it does not necessarily lead to a focussed domain (as in the case of conjunctions). 4 Live discourse context refers to a unified representation framework which is a contextual mapping of user’s
recent utterances and system’s responses. entity represents an object, event or a state.
5 An
322
Advances in Natural Multimodal Dialogue Systems
4.2.2 Dialogue progress and context processing. For the dialogue to progress smoothly, the reference domains, realized from the semantic representations, as outlined in the previous section, must be integrated in a procedural fashion. These mechanisms must reflect the continuity of the dialogue progress and should entail certain inference mechanisms, which could be applied upon such an integrated framework, so as to achieve the ultimate goal of fusing asynchronous multimodal inputs. Inherently, task-oriented dialogues are characterized by an incremental building process, where with the perceived cognition of system’s knowledge and awareness, the user strives towards fulfilling certain requirements which are necessary for the task completion. Indeed, these interactions go beyond simple slot-filling (or menu based) task requirements. At the level of dialogue progress, we can construe task-oriented dialogues as composition of several states named as context frames, which are individually constructed through incremental process. These states might be realized during a single utterance or can span several dialogue sequences. Dialogues are modelled as combination of incremental building and discrete transitions between these context frames. This is complimentary to the information state theories, prevalent in the literature. Indeed, the idea is to form a content representation in form of context frames, which have strong localized properties owing to highly correlated content structures, while across several such frames there is not much correlation. The basic constituting units within a context frame representation are: A unique identifier, assigned by the MMF; Frame type, such as terminal or non-terminal; Grounding status about the user’s input, based on the dialogue acts and the feedback report from the AP; Reference domains at various levels, as described in Section 4.2.1.
4.2.3 Reference resolution and fusion. Reference resolution strategies vary from one referring expression to another in the sense of differing mechanisms to partition the particular reference domain. One or more (in case of ambiguity) of these domains in the live context frame is selected and restructured by profiling the referent. The selection is constrained by the requirement of compatibility between the selected contextual domain and the underspecified domain constructed for the expression being evaluated [Salmon-Alt, 2000]. This entails restructuring mechanism at the level of context frames named as merging [Kumar et al., 2002], where the under-specified reference domains are integrated within the live context frame until the frame acquires the status of a discrete state, in which case it is pushed to the dialogue history.
MIAMM – A Multimodal Dialogue System using Haptics
323
The dialogue history comprises of the following three components, whose precise updating and retrieval processes are controlled by the MMF: Context History: is a repository of resolved semantic representations, in form of sequential context frames. Modality History: is a repository of modality interactions, which could not be integrated into the live context (possibly, because of the temporal lead of the modality event). If the modality history stack is not empty, all the member frames, which are within some time limit as compared to the live context frame, are tried for merging into the context frame. Besides, there are heuristics for deleting frames, if they remain unconsumed for long time and hence, rendered out of context. User Preferences: is a repository user’s preferences built over the course of current and previous discourses. In the output produced by MMF, all the pending references are resolved (in the worst case, few potential referents are provided) and the ensuing goal representation is passed to Action planner.
4.3
Action Planner
Task oriented cooperative dialogues, where both partners collaborate to achieve a common goal, can be viewed as coherent sequences of utterances asking for actions to be performed or introducing new information to the dialogue context. The task of the action planner is to recognize the user’s goal, and to trigger the required actions for the achievement of this goal. The triggered actions can be internal, such as database queries and the updating of the internal state of the system, or external, like communication with the user. In other words, the action planner is responsible for the control of both the task structure and the interaction structure of the dialogue.
4.3.1 Interaction and task structure. The interaction and task structure are modelled as sequences of communicative games, that may include embedded sub-games. Each of these communicative games consists of two moves, an initiative move (I) and a response move (R), one of them coming from an input channel and the other going to an output channel (from the point of view of the AP). Each application goal, be it a user goal or an internal subgoal, corresponds to one communicative game. Figure 14.8 shows a fragment of a sample dialogue from the MIAMM domain. This interaction consists, on the top level, of a communicative game, a simple “display game” including U1
324
Advances in Natural Multimodal Dialogue Systems
and SI, and is played by the user6 that makes the request, the AP that passes the request to the VisHapTac, and by the VisHapTac that displays the desired song list. This game includes an embedded “clarification game” (S1 and U2), and a “query game”, which is played internally by the AP and the domain model. U1: “Show me music” S1: “What kind of music are you looking for?” U2: “I want rock of the 80’s” S2: (shows the query results as a list) Figure 14.8.
Sample dialogue.
Interactions are thus viewed as joint games played by different agents, including the user and all the modules that directly communicate with the AP. The moves in each game specify the rules to play it. This approach allows the identification and unified modelling of recurrent patterns in interactions.
4.3.2 Interaction and task flow. To initiate the appropriate communicative game that will guide the interaction, the AP first has to recognize the overall goal that motivates the dialogue, i.e. it has to map a semantic representation coming from the MMF to a suitable application goal. These semantic representations include actions to be performed by the system, as well as parameters for these actions. The setting of a goal triggers the initiation of the corresponding “communicative game”. The subsequent flow of the game is controlled by means of non-linear regression planning with hierarchical decomposition of sub-goals, as used in the SmartKom project [Reithinger et al., 2003; Wahlster, 2005]. Each communicative game is characterized by its preconditions and its intended effects. On the basis of these preconditions and effects the AP looks for a sequence of sub-games that achieve the current goal. For example a “display game” requires a list of items and has the effect of sending a display request with this list as its parameter to VisHapTac, whereas a “database query game” requires a set of parameters to do the query and sends the query to MiaDoMo. If the preconditions are not met, AP looks for a game that satisfies them. After successful completion of this game, the triggering game is resumed. Communicative games specify thus a partially ordered and non-deterministic sequence of actions that lead to the achievement of a goal. Execution of system actions is interleaved with planning since we cannot predict the user’s future utter-
6 User
is here an abstraction over the speech interpretation and visual haptics interaction modules. All inputs reaching the action planner pass through the multimodal fusion component. There they are fused and integrated. The action planner does not know which input layer the inputs originally came from.
325
MIAMM – A Multimodal Dialogue System using Haptics
ances. This strategy allows the system to react to unexpected user inputs like misunderstandings or changing of goals.
DISPLAY R (VisHapTac OUT) S2 result
I (User IN) U1
DB-QUERY parameters
result
GET-PARAMETERS
R I (VisHapTac/Speech OUT) (User IN) U2 S1 Figure 14.9.
DB-QUERY
R I (MiaDoMo OUT) (MiaDoMo IN) result query
A sample communicative game.
Figure 14.9 illustrates the “display game“ shown in Figure 14.8, spanning from U2 to U3. The names of the communicative games are written in capitals (DISPLAY, GET-PARAMETERS and DB-QUERY). Each game includes either an initiative-response (IR) pair or one or more embedded games. In this example the top-level game DISPLAY, includes an embedded game DBQUERY, that itself includes two embedded games, GET-PARAMETERS and DB-QUERY. The leaves indicate moves, where I and R say if the move is an initiative or a response, and the data in brackets defines the channel from/to which the data flows. The arrows connecting communicative games show dependency relations. The label of the connecting arrows indicates the data needed by the mother-game that induced the triggering of a sub-game providing this data. The GET-PARAMETERS game sends a presentation task to the VisHapTac, asking the user for the needed parameters. Similarly the DB-QUERY game sends a database query to the MiaDoMo. In both cases, the DM waits for the expected answer, as coded in the response part of the game, and provides it to the triggering game for further processing.
326
5. 5.1
Advances in Natural Multimodal Dialogue Systems
The Multimodal Interface Language (MMIL) Design Framework
The Multimodal Interface Language (MMIL) is the central representation format of the MIAMM software architecture. It defines the exchange format of data exchanged between the modules of the MIAMM system. It is also the basis for the content of the dialogue history in MIAMM, both from the point of view of the objects being manipulated and the various events occurring during a dialogue session. Therefore, the MMIL language is not solely dedicated to the representation of the interaction between the user and the dialogue system, but also of the various interactions occurring within the architecture proper, like, for instance, a query to the domain model. It provides a means to trace the system behaviour, in continuity as what would be necessary to trace the manmachine interaction. As a result, the MMIL language contains both generic descriptors related to dialogue management, comprising general interaction concepts used within the system and domain specific descriptors related to the multimedia application dealt with in the project. This ambitious objective has a consequence on the design of the MMIL language. The language is formulated using XML: Schemata describe the admissible syntax of the messages passed through the system. Since the actual XML format is potentially complex, but above all, required some tuning as the design of the whole system goes on, we decided not to directly draft MMIL as an XML schema, but to generate this schema through a specification phase in keeping with the results already obtained in the SALT7 project for terminological data representation, see [Romary, 2001]. We thus specify the various descriptors (or data category) used in MMIL in an intermediate format expressed in RDF and compatible within ISO 11179, in order to generate both the corresponding schema and the associated documentation, see [Romary, 2002a].
5.2
Levels of Representation – Events and Participants
Given the variety of levels (lexical, semantic, dialogue etc.) that the MMIL language must be able to represent, it is necessary to have an abstract view on these levels to identify some shared notions that could be the basis for the MMIL information architecture. Indeed, it can be observed that most of these levels, including graphical and haptic oriented representations, can be modelled as events, that is temporal objects that are given a type and may enter a network of temporal relations. Those events can also be associated with participants which are any other object either acting upon or being affected by the event. For instance, a lexical hypothesis in a word lattice can be seen as an
7 http://www.loria.fr/projets/SALT
MIAMM – A Multimodal Dialogue System using Haptics
327
event (of the lexical type), which is related to other similar events (or reified dates) by temporal relations (one hypothesis precedes another, etc.) and has at least one participant, that is the speaker, as known by the dialogue system. Events and participants may be accessible in two different ways. They can be part of an information structure transferred from one module to another within the MIAMM architecture, or associated to one given module, so that it can be referred to by any dependency link within the architecture. This mechanism of registration allows for factorisation within the MIAMM architecture and thus lighter information structures being transferred between modules. Two types of properties describe events and participants: Restrictions, which express either the type of the object being described or some more refined unary property on the corresponding object; Dependencies, which are typed relations linking two events or an event to one of its participants. From a technical point of view, dependencies can be expressed, when possible, by simple references within the same representation, but also by an external reference to an information structure registered within the architecture.
5.3
Meta-Model
From a data model point of view the MMIL structure is based on a flat representation that combines any number of two types of entities that represent the basic ontology of MIAMM, namely events and participants. An event is any temporal entity either expressed by the user or occurring in the course of the dialogue. As such, this notion covers interaction event (spoken or realized through the haptic interface), events resulting from the interpretation of multimodal inputs or event generated by decision components within the dialogue system. For instance, this allows us to represent the output of the action planner by means of such an event. Events can be recursively decomposed into sub-events. A participant is any individual or set of individuals about which a user says something or the dialogue system knows something about. Typical individuals in the MIAMM environment are the user, multimedia objects and graphical objects. Participants can be recursively decomposed into sub-participants, for instance to represent sets or sequences of objects. Events and participants cover all the possible entities that the MIAMM architecture manipulates. They are further described by means of various descriptors, which can either give more precise information about them (restrictions) or relate events and participants with one another (dependencies). Both types of descriptors are defined in MMIL as Data Categories, but dependencies are given a specific status by being mostly implemented as elements
328
Advances in Natural Multimodal Dialogue Systems
attached to encompassing MMIL structure. Dependencies can express any link that can exist between two participants (e.g. part-whole relation), two events (temporal order), or between a participant and an event (“participants” to a predicate). Events and participants can be iterated in the MMIL structure, which leads to the meta-model schematised in Figure 14.10, using the UML formalism. Furthermore, the representation shows an additional level for the representation of the temporal information associated with events.
Figure 14.10.
5.4
UML diagram representing the MMIL information meta-model.
Data Categories
Data category specifications are needed to identify the set of information units that can be used as restrictions and dependencies to instantiations of nodes from the meta-model. Following are the types of data categories incorporated within MMIL specifications: Data Categories describing both events and participant: general information such as identifiers, lexical value, attentional states, and ambiguities, about events or participants; Data categories for events: information pertaining to certain types of system-known events and functional aspect of user’s expressions; Data categories for participants: exclusive information about participants such as generic types and other related attributes;
MIAMM – A Multimodal Dialogue System using Haptics
329
Data categories for time level information: temporal positioning and duration for an event; Relations between events and participants: relation mappings between events and participants, using the knowledge available at certain stage of processing such as /object/, /subject/ etc.; Relations between events: propositional aspects and temporal relations among events such as /propContent/etc. ; Relations between participants, e.g., similarity relationships (see below).
5.5
Sample Illustration
Given those preliminary specifications, the representation of semantic content of a simple sentence like “play the song” would be as follows: <mmilComponent> <event id="e0"> <evtType>speak request <speaker target="User"/> <event id="e1"> <evtType>play <mode>imperative Present <participant id="p0"> singular tune definite pending <participant id="User"> User 1PPDeixis pending
As can be seen from above, it is possible to mix information percolating from lower levels of analysis (like tense and aspects information) with more
330
Advances in Natural Multimodal Dialogue Systems
semantic and/or pragmatic information (like the referential status of the participant). Kumar and Romary [2003] illustrate and examine this representational framework against typical multimodal representation requirements such as expressiveness, semantic adequacy, openness, uniformity and extensibility.
5.6
Additional Mechanisms
The sample illustration is very simple and obviously does not seem to be exhaustive and flexible enough for true multimodal interactions. Essentially, the MMIL design framework allows for certain additional mechanisms, see [Romary, 2002b] for details, which impart sufficient representational richness and integration flexibility within any kind of multimodal design: Alternatives and Ranges; Temporal positioning and duration; Refinements of data categories. As specified in ISO 16642, it is possible, when needed, to refine a given data category by means of additional descriptors. Consider, e.g., that a similarity query is expressed by a /similar/ relation between two participants as follows: <mmilComponent> ... <participant id="id1"> ... <participant id="id2"> ... ...
<mmilComponent> ... <participant id="id1"> ... <participant id="id2"> ... genre author ...
It is possible to express more precisely the set of dimensions along which the similarity search is to be made, as illustrated immediately above.
MIAMM – A Multimodal Dialogue System using Haptics
6.
331
Conclusion
The main objective of the MIAMM project was the development of new concepts and techniques for user interfaces employing graphics, haptics and speech to allow fast navigation in large amounts of data and easy access to it. This goal poses interesting challenges as to how can the information and its structure be characterized by means of visual and haptic features. Furthermore it had to be defined how the different modalities can be combined to provide a natural interaction between the user and the system, and how the information from multimodal sources can be represented in a unified language for information exchange inside the system. The final MIAMM system combines speech with new techniques for haptic interaction and data visualization to facilitate access to multimedia databases on small handheld devices [Pecourt and Reithinger, 2004]. Interaction is possible in all three target languages German, French, and English. The final evaluation of the system supports our initial hypothesis that users prefer language to select information and haptics to navigate in the search space. The interaction proved to be intuitive in the user walkthrough evaluation [van EschBussemakers and Cremers, 2004]. Nevertheless there are still open questions and further research is still needed to exhaust the possibilities that multimodal interfaces using haptics offer. This includes the conception of new visualization metaphors and their combination with haptic and tactile features, as well as the modelling and structuring of the data to take advantage of the expressivity of these modalities. The results of these investigations can provide interesting insights that help to cope with the problem of the constant growth of available information resources and the difficulty of its visualization and access.
References Beyer, L. and Weiss, T. (2001). Elementareinheiten des Somatosensorischen Systems als Physiologische Basis der Taktil-Haptischen Wahrnehmung. In Grunewald, M. and Beyer, L., editors, Der Bewegte Sinn, pages 25–38. Birkhäuser Verlag, Basel. Card, S. K., Mackinlay, J. D., and Shneiderman, B., editors (1999). Readings in Information Visualization: Using Vision to Think. Morgan Kaufmann Publishers Inc. Fedeler, D. and Lauer, C. (2002). Technical Foundation of the Graphical Interface. Technical report, DFKI, Saarbrücken, Germany. Project MIAMM – Multidimensional Information Access using Multiple Modalities, EU project IST-20000-29487, Deliverable D2.2. Gärdenfors, P. (2000). Conceptual Spaces: The Geometry of Thought. MIT Press, USA.
332
Advances in Natural Multimodal Dialogue Systems
Kumar, A., Pecourt, E., and Romary, L. (2002). Dialogue Module Technical Specification. Technical report, LORIA, Nancy, France. Project MIAMM – Multidimensional Information Access using Multiple Modalities, EU project IST-20000-29487, Deliverable D5.1. Kumar, A. and Romary, L. (2003). A Comprehensive Framework for MultiModal Meaning Representation. In Proceedings of the Fifth International Workshop on Computational Semantics, Tilburg, Netherlands. Maybury, M. T. and Wahlster, W., editors (1998). Readings in Intelligent User Interfaces. Morgan Kaufmann Publishers Inc. Pecourt, E. and Reithinger, N. (2004). Multimodal Database Access on Handheld Devices. In The Companion Volume to the Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 206–209, Barcelona, Spain. Association for Computational Linguistics. Reithinger, N., Alexandersson, J., Becker, T., Blocher, A., Engel, R., Löckelt, M., Müller, J., Pfleger, N., Poller, P., Streit, M., and Tschernomas, V. (2003). SmartKom: Adaptive and Flexible Multimodal Access to Multiple Applications. In Proceedings of the Fifth International Conference on Multimodal Interfaces (ICMI), pages 101–108, Vancouver, Canada. ACM Press. Romary, L. (2001). Towards an Abstract Representation of Terminological Data Collections - the TMF Model. In Proceedings of Terminology in Advanced Microcomputer Applications (TAMA), Antwerp, Belgium. Romary, L. (2002a). MMIL Requirements Specification. Technical report, LORIA, Nancy, France. Project MIAMM – Multidimensional Information Access using Multiple Modalities, EU project IST-20000-29487, Deliverable D6.1. Romary, L. (2002b). MMIL Technical Specification. Technical report, LORIA, Nancy, France. Project MIAMM – Multidimensional Information Access using Multiple Modalities, EU project IST-20000-29487, Deliverable D6.3. Salmon-Alt, S. (2000). Reference Resolution within the Framework of Cognitive Grammar. In Proceedings of International Colloquium on Cognitive Science, San Sebastian, Spain. van Esch-Bussemakers, M. P. and Cremers, A. H. M. (2004). User Walkthrough of Multimodal Access to Multidimensional Databases. In Proceedings of the Sixth International Conference on Multimodal Interfaces (ICMI), pages 220–226, State College, PA, USA. ACM Press. Wahlster, W., editor (2005). SmartKom: Foundations of Multimodal Dialogue Systems. Springer, Berlin.
Chapter 15 ADAPTIVE HUMAN-COMPUTER DIALOGUE Sorin Dusan and James Flanagan Center for Advanced Information Processing Rutgers University, USA
{sdusan, jlf}@caip.rutgers.edu Abstract
It is difficult for a developer to account for all the surface linguistic forms that users might need in a spoken dialogue computer application. In any specific case users might need additional concepts not pre-programmed by the developer. This chapter presents a method for adapting the vocabulary of a spoken dialogue interface at run-time by end-users. The adaptation is based on expanding existing pre-programmed concept classes by adding new concepts in these classes. This adaptation is classified as a supervised learning method in which users are responsible for indicating the concept class and the semantic representation for the new concepts. This is achieved by providing users with a number of rules and ways in which the new language knowledge can be supplied to the computer. Acquisition of new linguistic knowledge at the surface and semantic levels is done using multiple modalities, including speaking, typing, pointing, touching or image capturing. Language knowledge is updated and stored in a semantic grammar and a semantic database.
Keywords:
Adaptive dialogue, dialogue systems, language acquisition, language understanding, multimodal interaction, grammar.
1.
Introduction
Among life forms, humans have unique characteristics—the ability to learn and use human language. This ability makes communication among humans natural, efficient and intelligent. Human interaction with computers, relying on keyboard, display, and mouse, does not reach a comparable level of naturalness and intelligence. And, since computers have become omnipresent in our lives, a need has emerged to make human-computer interaction more natural, efficient, and intelligent. One way to achieve this goal is to make human-computer communication resemble human-human communication by using spoken lan333 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 333–354. © 2005 Springer. Printed in the Netherlands.
334
Advances in Natural Multimodal Dialogue Systems
guage and manual gesture. Another way is to create computer agents which have artificial intelligence capable of learning and adapting to new events and situations, including new linguistic knowledge. Although advances in spoken language processing have enabled today’s computers to speak and recognize spoken utterances, these actions are in general limited to restricted domains. Moreover, the specific task domains are not fully covered by all the possible surface linguistic forms a user might want to use in accomplishing the task within that domain. It is, as yet, impossible to program computers to speak and understand human language as people do. A current research interest therefore is to create computer systems capable of learning human language and of acquiring related knowledge as an interaction proceeds. The work presented in this chapter focuses on creating computer interfaces and agents capable of acquiring new linguistic-semantic knowledge and which can adapt their vocabularies, grammars and semantics by learning from users who are employing multiple input modalities. Although initial attempts in building such systems appeared more than a decade ago, this area of research in spoken language processing is now attracting increased theoretical and practical interests and is emerging as a distinct research field. This chapter proceeds as follows: Section 2 offers an overview of computer language acquisition. Dialogue systems are discussed in Section 3, with focus on spoken dialogue systems and on multimodal dialogue. Section 4 describes our method for representing language knowledge in the computer in two separate databases: (i) a grammar, for storing the surface linguistic information, and (ii) a semantic database, for storing the semantic representation. Section 5 presents our dialogue adaptation approach for acquiring new language knowledge and adapting the vocabulary of the dialogue. Multimodal interaction is central throughout this work. Preliminary experiments using language acquisition and dialogue adaptation are summarized in Section 6. A summary and conclusions are presented in Section 7, along with comments relating to future advances.
2.
Overview of Language Acquisition
Adapting the dialogue of a human-computer interface implies language acquisition. Acquisition of language knowledge by machines is still in an early stage and is mostly demonstrated in restricted laboratory prototypes. It can be achieved in different ways and at various levels of pre-programmed language knowledge, and these are strongly dependent upon the type of application targeted. A short overview of various approaches to language acquisition is presented in the following.
Adaptive Human-Computer Dialogue
335
Among the first approaches to language acquisition was the one by Gorin et al. [1991], who pursued experiments for automated call routing based on text input. This approach constructed connectionist networks between words and meaningful machine actions. Information-theoretic networks were built in which the weights were based on mutual information. For example, if the word collect occurred many times in the requests for collect call charges, the mutual information becomes high and was used to update the network weights linking this word with the computer action switch to collect-call. An extension of this work was made later by applying spoken input to the machine instead of typed text [Gorin et al., 1994]. In these experiments the semantics were fixed and were represented by a list of machine actions, resulting in singlelayer information theoretic networks. These experimental studies were later developed into a system for automatically routing telephone calls using the caller’s natural spoken enquiry [Gorin et al., 1997]. A structured network, represented by a two-dimensional product network, has been evaluated in a database retrieval experiment called Almanac [Miller and Gorin, 1993]. In this experiment the machine was able to answer questions about 20 attributes related to each of the 50 states in the U.S.A. An example of such an enquiry is What is the capital of New Jersey? The input was applied independently to both state and attribute networks and the outputs were combined in an outer sum, selecting one of the 1000 facts which had the highest sum. Again, the machine used fixed, pre-programmed actions (1000 facts) and only acquired new words which were associated probabilistically to these machine actions. However, in all these experimental studies the human-machine dialogue was unconstrained and it was only limited, in the speech-input cases, by the vocabulary of the speech recognizers used. An experimental study on language acquisition based on typed text and visual information was published by Sankar and Gorin [1993]. These authors used a simulation of the blocks world displayed on a computer screen and the computer acquired associations between words and different objects and colours. Thus, after some training, the computer was able to locate on the display the position of objects based on their corresponding colour and shape names. These attributes were trained in two different sensory primitive subnetworks. This work was later extended in an experimental study for a robot language acquisition with six sensory input channels by Henis et al. [1994]. Gavalda and Waibel [1998], created a tool that allowed non-expert end-users to expand a semantic grammar at run-time. This tool consisted of two stages: an authoring stage in which the developer created a domain model and a kernel grammar, and a run-time stage in which new written sentences were automatically parsed and the end-user was asked to teach their meaning. The domain model was considered fixed after the authoring stage and thus the end-users could only expand the surface expressions within the given semantic domain,
336
Advances in Natural Multimodal Dialogue Systems
by adding new rules to the semantic grammar. Three methods of learning were employed for the acquisition of the semantic mappings of the new expressions: parser predictions, hidden understanding model and end-user paraphrases. In the reported experiments, two non-expert users were asked to teach the system the meaning of 72 new sentences. In these interactions the system acquired 8 new rules in 30 minutes from the first user and 17 new rules in 35 minutes from the second user. These results were compared with the results obtained by an expert grammar writer who developed the semantic grammar using the same 72 new sentences based on a traditional method. The expert added 8 new rules in 15 minutes. A computational model for cross-channel early lexical learning (CELL) in infants was published by Roy [1999]. In this study, words describing objects’ shapes and colours were acquired by the system when discovering high mutual information due to the simultaneous occurrences of these words captured by a microphone and the corresponding primitive semantics captured by video cameras. The underlying idea in this method is similar to the connectionist approach of Gorin and his colleagues, with the main difference that here the semantic categories are not pre-programmed, but rather they are acquired in conjunction with the new linguistic units. Thus the network’s weights are modelled by the discovered mutual information between words and visually grounded semantic categories, both being discovered and detected by the system. Another distinctive feature is that here the audio and visual inputs are not pre-segmented by users of the system into units corresponding to words and their semantics. The segmentation is discovered by the system from raw audio and visual inputs. However, like most of the previous approaches, this study did not deal with the acquisition of syntax or grammar. A similar study, focusing on the discovery of useful linguistic-semantic structures from raw sensory data was published by Oates [2001]. The goal was to enable a robot to discover associations between words and different semantic representations obtained from a video camera. We introduced previously an interactive method for multimodal language acquisition [Dusan and Flanagan, 2001]. In this method the user provides the computer with both syntactic and semantic information for the newly acquired lexical units. Acquisition of syntactic information corresponding to new words or phrases is based on user-provided classification of these new linguistic units. Orthography for the new units is obtained by automatic speech recognition or by typing. The corresponding semantic representation can be acquired by the computer for each new linguistic unit through multiple input modalities, such as speaking, typing, pointing, drawing or image capturing. This initial study was later extended as an application for adaptive dialogue in humancomputer communication [Dusan and Flanagan, 2002a; Dusan and Flanagan, 2002b]. The main idea underlying this research is that the dialogue system can
Adaptive Human-Computer Dialogue
337
be adapted at run-time without intervention of the developer. The developer can account neither for all surface realizations of a given meaning, nor for all semantic representations different users might need in a specific application. Thus the developer can provide a core language knowledge which later can be expanded and personalized by individual end-users. Details of this approach, guidelines for implementing adaptive dialogue system, and a summary of several preliminary experiments will be provided in the next sections.
3.
Dialogue Systems
A dialogue system is a machine that can use human language and engage in a dialogue with users in order to accomplish certain tasks [Sadek and de Mori, 1998]. In order to be efficient and to resemble human-human dialogue the system must have some abilities of negotiation, context interpretation, interaction and language flexibility. In general, a dialogue with the machine is a sequential process and contains multiple parts or turns. These turns can be initiated by machine, by user, or both (mixed initiative). Usually the dialogue is restricted to a small domain, characteristic of a particular application. Dialogue systems can be built using spoken or written natural language. In this chapter we are concerned primarily with spoken dialogue systems.
3.1
Spoken Dialogue Systems
Implementation of spoken dialogue systems (SDS) is possible due to advances in automatic speech recognition (ASR), text-to-speech (TTS) and natural language understanding (NLU) technologies. A general block diagram of an SDS, either implemented on a desk-top computer, on a PDA, or on a remote information system accessed by telephone, is represented in Figure 15.1. The system consists of four main modules: a speech understanding module, a system belief and state module, a dialogue manager, and a speech generation module. The speech understanding module transforms the user’s speech input SU into user dialogue acts AU which can be represented by a frame of information regarding the user’s intention [Young, 2002]. This module contains a chain of blocks in which an automatic speech recognition (ASR) transforms the speech signal SU into a text string of words, which then are transformed by a semantic decoding block into a string of concepts, which are used by a dialogue act detector to estimate user’s acts AU . The estimated user dialogue acts are then applied simultaneously to the system belief & state module and to the dialogue manager module. The system’s state S and belief B can thus be changed by the incoming user dialogue acts AU . In some SDS the current state S and belief B can be used to estimate the new user dialogue acts AU by the speech understanding module. The user dialogue acts AU , and system state S and belief B are then used by the dialogue manager to estimate the new
338
Advances in Natural Multimodal Dialogue Systems
system dialogue acts AS . The system dialogue acts are then transformed into synthetic output speech SS by the speech generation module. User Dialogue Acts
Speech Input
Su
Au Speech Understanding
System Belief & State Au
B
S
Ss
Speech Generation Speech Output Figure 15.1.
As
Dialogue Manager
System Dialogue Acts
General architecture for a spoken dialogue system.
The speech understanding module usually contains a grammar and a semantic parser which are used to identify semantic concepts from the user’s utterances. The grammar stores the recognized and understood sentences and language rules. In general this grammar is fixed and is pre-programmed by the developer. The system’s belief B emerges from the system’s knowledge. Adapting the dialogue at run-time requires in general changing both the grammar and the system’s knowledge. The new words are stored in the grammar and their semantic representation in a separate database.
3.2
Multimodal Dialogue
One way of achieving a more natural human-computer interaction is by employing multiple input-output modalities based on sight, sound and touch. These additional modalities can complement and supplement the traditional computer interfaces based on keyboard, display and mouse. Speech is by far the most natural means of human communication. Fusing user’s gestures with the speech modality makes the dialogue more efficient. A very early attempt to combine speech and pointing was made by Bolt [1980]. He proposed to solve deictic utterances containing terms like that or there by combining the pointing information with speech. Hand gestures can be incorporated by using a light pen on the screen, a stylus with a pen tablet, or a video camera. Eye gestures can be integrated by employing a gaze tracker. Depending on domain and environment, some
339
Adaptive Human-Computer Dialogue
modalities are more or less appropriate in particular applications. For example, on a desktop-computer application, speech, keyboard, pen and mouse are primarily used, whereas on a PDA application, speech and pen prevail. Simultaneous use of multiple modalities can make interactions more efficient by helping to solve ambiguities from imprecise, incomplete or anaphoric utterances. Efficiency studies have shown that using speech and pen integration one can achieve tasks approximately 3 to 8 times faster than when using GUI alone [Cohen et al., 1998]. Consider an application in which the user creates and manipulates different objects or icons on the screen using speech and hand gestures (pen or mouse). In these cases, users can move objects on the screen more efficiently by saying, for example, ‘Move this here’ and pointing appropriately to a position necessary to identify the object referred to by this, and then pointing to a position specifying the target location referred to by here. Figure 15.2 shows such an example in which speech and pointing are fused in order to achieve such efficiency [Dusan and Flanagan, 2001]. After recognizing the spoken utterance, the time stamps of the words this and here are used to backtrack the mouse positions when these words occurred and to identify the corresponding object this and the target position here.
here
Move
this
Figure 15.2. Speech and pointing in an utterance ‘Move this here.’
340
4.
Advances in Natural Multimodal Dialogue Systems
Language Knowledge Representation
In computer applications language knowledge is commonly structured at two main levels: a surface level and a deep level. At the surface level, language is represented by knowledge of the vocabulary, syntax and grammar. At the deep level, language is represented by semantics or the meanings of various surface form realizations. Based on this classification, we developed a computer dialogue system in which the language knowledge is stored in two databases: a grammar that retains the surface language knowledge, and a semantic database that encapsulates the corresponding meanings.
4.1
Grammar
Grammar represents a specification of the allowed sentence structures in a language. A rule grammar defines the allowed sentence structures by a set of production rules. A context-free grammar (CFG) consists of a single start symbol and a set of production rules in which the left-hand sides are single non-terminals. A semantic grammar represents a special form of a rule grammar [de Mori, 1999] in which the non-terminal symbols represent semantic classes of concepts, such as colours, fruits and geometric shapes, and the terminal symbols represent concept words such as yellow, apple and rectangle. A common approach in semantic decoding is to tag each member of a semantic class (concept class) with a concept label. These tags or concept labels can be written as words included in curly brackets. For example, the word blue can have attached a tag {blue}. Function words or irrelevant words can be left untagged and thus an utterance can be transformed into a string of tags that simplify the semantic interpretation. An utterance such as ‘Select the blue colour’ can be decoded into the concept label strings ‘{select} {blue}’. The grammar needs to contain a kernel of surface language knowledge with enough rules and classes of concepts necessary to cover most of the surface realizations for an application in a specific domain. An example of such a rule is ‘Select {select} [the] [colour]’. The word colours included between angle brackets represents a non-terminal symbol (a semantic class) and the words included between square brackets are optional in users’ utterances. It is difficult to ensure 100% coverage of these surface forms for a specific application, therefore lower coverage values can be expected and achieved by hand-crafting these rules in reasonable time. In addition to the required rules the grammar should contain enough concept classes necessary to support the application. Each of these concept classes needs to contain only a few preprogrammed examples of concepts, and they are usually words or phrases belonging to various syntactic classes. We call concept class a non-terminal symbol in the grammar representing a semantic class. The grammar rules incorporate the syntactic information necessary for a dialogue in the application’s limited
Adaptive Human-Computer Dialogue
341
domain. Thus, the hand-crafted grammar rules have a very important role in covering the domain of the application and they require special attention from the developer.
4.2
Semantic Database
Interpretation of utterances for performing necessary computer actions is based on semantic knowledge stored in the semantic database. This database contains a set of objects that contain a representation of meaning for each concept stored in each class in the grammar. In developing an adaptive dialogue system, a class is constructed in the semantic database for each of the concept classes included in the grammar. A number of objects which we call here semantic objects are then built from each class, each object corresponding to a concept stored in the corresponding concept class of the grammar. At run-time users can automatically build additional objects in these classes and store them in the semantic database. The semantic objects are created using object-oriented programming and they are instances of the classes in the semantic database. The semantic representation stored in such an object defines the computer knowledge and interpretation of the corresponding concept. For example, the semantic object corresponding to the concept blue included in the semantic class colours, has the semantic representation defined by the RGB attributes (0, 0, 255). These attributes represent the computer’s semantic representation regarding the concept blue. The semantic object for the concept square contains a pointer to regular polygon and an attribute equal to 4 representing the number of sides. All the characteristics of a regular polygon are thus inherited by the semantic object square. A semantic object for a drawing is represented by a starting point and a set of lines, each being described by the x and y coordinates of the cursor on the screen. The semantic representation necessary to build these objects is either pre-programmed or acquired from the user through multiple input modalities.
5.
Dialogue Adaptation
Before presenting our method of adaptive human-computer dialogue, let us consider some theories of language and language acquisition by humans. One of the prominent theories is that of Chomsky [1986], in which the faculty of language represents a distinct system of the human’s mind, and evolves through experience and training from an initial state SO to a relatively stable steady state SS . In this latter state, the language faculty is already structured and it supports only peripheral modifications such as acquiring new vocabulary items. This theory focuses on formal structural properties of the language faculty, whereas other theories concentrate on the semantic properties of this
342
Advances in Natural Multimodal Dialogue Systems
Speech Input
Language Understanding and Adaptation ASR
Semantic Decoding
Parser
Typing
New Concepts Grammar
Pointing Drawing
Multimodal Semantic Acquisition
Video
Figure 15.3.
System Belief and State
Display Output
Dialogue Manager Semantic Database
Dialogue History Speech Generation TTS
Speech Output
Block diagram of the adaptive dialogue system.
faculty. An interesting aspect of language acquisition is whether a child has an active role or a passive role [de Villiers and de Villiers, 1978]. A theory emphasizing the active role suggests that the child does not learn language but creates it. At the other extremity, another theory emphasizes the parents’ main role in teaching the child language and hence, the child only has a passive role in acquiring structure. These insights into humans’ language development and acquisition represent some basic guidelines in our approach to adaptive dialogue systems.
5.1
System
The proposed adaptive human-computer dialogue system is supported by a computer with a multimodal interface based on a microphone and loudspeakers for speech input and output, a keyboard for typing, a mouse for pointing, a pen tablet for drawing and handwriting, a CCD camera for image capturing and a display for graphics and text output. A detailed block diagram of the system is given in Figure 15.3. The system contains an automatic speech recognition engine (ASR) that converts spoken utterances into text strings. The ASR is capable of recognizing unconstrained speech using a very large vocabulary of words. The text strings converted by the ASR are then applied to a parser within a Language
Adaptive Human-Computer Dialogue
343
Understanding and Adaptation block. The parser analyzes the text strings by comparing them with the allowed sentences stored in the grammar. When the text string matches one of the rules stored in the grammar, the parser sends the utterance including a string of tags (concept labels) to a semantic decoding block for interpretation. In order to interpret properly the user’s utterances this block uses information from three sources: first, information from the semantic database, corresponding to each concept; second, information from the Dialogue Manager block (dialogue history) in order to infer the context; and third, information from the System Belief and State block. The output of the semantic decoding is either communicative, and sent to the dialogue manager, or non-communicative and sent to the system belief and state block, or both. When the user’s utterances contain unknown concepts, the parser cannot match these utterances to any of the rules stored in the grammar, and then proceeds to identify the unknown words or phrases in the corresponding text string. Once new concepts are detected they are stored in the New Concepts block which issues a signal to the Dialogue Manager to let the user know that these terms are unknown to the system. The Dialogue Manager then sends a command to the Speech Generation block, containing a text-to-speech (TTS) engine, to ask the user by synthetic voice to provide a semantic representation of the new concepts. Based upon where the user classifies the new concepts, he or she can provide a semantic representation using different input modalities. After the semantic representation is provided and acquired by the Multimodal Semantic Acquisition block, the system stores the new concepts in the corresponding concept class in the grammar and creates a new semantic object which is stored in the semantic database. In the case that the new concepts are difficult to be recognized by the ASR due to their high confusability with other words in the recognizer’s vocabulary, a means to teach the computer the correct spelling of these new concepts is by typing on the keyboard. After that, the acquisition process follows the same cycle as when the new concepts were obtained from the ASR. As mentioned earlier, the system stores the language knowledge in two databases: a grammar and a semantic database. These are permanently stored in two different files on a hard disk from which they are loaded into the computer’s volatile memory when the application starts. When the adaptive system detects unknown concepts and the user provides the corresponding semantic representation, the system dynamically updates the grammar and the semantic database from its memory with the new linguistic knowledge. At the end of the each application session, the user has the option to save permanently the updated grammar and semantic database in the corresponding files on the hard disk.
344
Advances in Natural Multimodal Dialogue Systems
Multimodal Interaction
Adaptive Dialogue System Spoken Interface
1
Understanding 2
3
Adaptation
Grammar
Figure 15.4.
5.2
Semantic Database
Storing, using and adapting the language knowledge.
Method
Our method of adapting the computer’s dialogue at run-time is based on preprogramming a kernel of language structure and dialogue. The system then attempts only to extend the original vocabulary of concepts, and not to acquire new language structure. We call this method of language acquisition horizontal adaptation, since only new members of existing concept classes are acquired. A method of acquiring new language structure would thus represent a vertical adaptation of language knowledge. In horizontal adaptation the developer has to prepare a kernel of grammar and a semantic database containing the necessary number of concept classes and language rules to ensure a good coverage of the domain of the application. This is usually done by hand-crafting such a grammar and requires some special knowledge and skills. Figure 15.4 shows a general functional diagram of the adaptive dialogue process with emphasis on the language knowledge databases. Users interact with the system by employing multiple input-output modalities, as represented by the link (1). The pre-programmed language knowledge is stored into the grammar and the semantic database. The language information stored in these databases is used by the system to understand user’s utterances, as shown by the upper arrows of the links (2) and (3). The bottom arrows of these links show the direction of information for adapting the language knowledge. This approach means primarily that the computer will be able to understand new concepts (words or phrases) within existing concept classes. The method is supervised learning in which the user has an active role in teaching the computer new concepts and their corresponding semantics. In this architecture, the
345
Adaptive Human-Computer Dialogue
grammar must contain, in addition to the specific language rules for the application, some special rules, necessary to enable the user to add new members into the existing concept classes. For a specific application, the domain can be usually covered by a relatively small number of concept classes, so the number of special rules necessary to adapt the grammar will not represent a difficult problem for the user in learning them. In addition, the adaptation rules can be learned by the users only when necessary by consulting a manual or an on-line help. Consider that the pre-programmed grammar contains R rules, such as ‘Move {move} this here’, or ‘Select {select} [the] [colour]’, etc. and CC concept classes, such as objects, colours, etc. Here, for example, the symbol can be replaced by any member of the concept class colours. Further, the concept classes contain a total number of N0 concepts pre-programmed by the developer. Here, the concept classes can be of any syntactical category, such as nouns, adjectives, verbs, etc. Also, there might be multiple concept classes representing the same syntactical category but each of them represents a different semantic class of concepts, e.g., the concept class fruits (nouns) and the concept class baskets (nouns). The end-users can then adapt the system’s language knowledge by expanding the grammar from the original total number of concepts N0 , to N(t) concepts at time t, N(t) =
CC
Ni (t)
(15.1)
i=1
where the number of concept classes CC, is constant and the number of concepts Ni (t) in any class i, can be increased. Figure 15.5 presents a logic diagram of the process of adapting the linguistic knowledge of the system by users at run-time. The adaptation process starts with a user’s utterance that contains unknown concepts. The system checks if the utterance is known and legal. If yes, a communicative or non-communicative command is executed. If no, the system searches for unknown words or phrases and then asks the user to provide the surface and semantic information for these new concepts. The user can either reject this request or provide the linguistic information. If the user decides to teach the computer the new concept, he or she has to indicate the semantic class in which this concept will be included and provide a semantic representation. A number of special language rules, pre-programmed by the developer will help the user teach the computer new concepts and their meanings. Suppose that the user knows the rules pre-programmed for a specific application, but does not necessarily remember all the understood concepts stored in each concept class. For example, in a graphics application the user might think that pink is a known colour and will say ‘Select the pink colour.’ In the
346
Advances in Natural Multimodal Dialogue Systems
Start User utterance Execute command Find unknown No words
Known sentence?
Yes
Ask user for surface & semantic info.
Store new language knowledge
No
Figure 15.5.
Receive info.?
Yes
Logic diagram for language adaptation.
case that pink is not pre-programmed by the developer, the system will identify the word pink as unknown, and will ask if the user wants to provide the necessary surface and semantic information for this new concept. If the user decides to provide the linguistic information, he or she has to specify first the exact concept class to which this new word belongs, such as colours in this example. Then the computer must be provided a semantic representation, in this example by asking the computer to display a colour palette and pointing with the mouse to a region corresponding to a pink colour. In this example, both surface information can be provided with an utterance, such as ‘Pink is this colour. The RGB attributes of the pixel to which the mouse was pointing are detected and used to build a new semantic object called pink from the class colours in the semantic database. For this example, the special rule necessary to acquire new concepts in the concept class colours is ‘ is this colour.’ In a similar way, the semantic representation of a new word referring to a colour can be acquired from an image captured by a CCD camera. A special case of teaching new concepts occurs when the user provides synonyms to existing concepts. In this case, the computer uses the existing semantic object to create another semantic object with a different name. However, if the user prefers, a completely new semantic object can be created for the acquired synonym. For example, if a user wants to use the word dark instead of black he can say ‘Select the dark colour’. The computer will answer ‘I don’t
Adaptive Human-Computer Dialogue
347
know what dark means’, and then the user can say ‘Dark means the colour black’. The grammar can also contain homonyms, but since these concepts have the same spellings and different meanings, they cannot be part of the same concept class. Being in different concept classes, the homonyms are correctly correlated with corresponding semantic objects since they are derived from different semantic classes. For example, the word pentagon can be used to create the concept pentagon representing a polygon with five sides or angles, and can also be used to create the concept Pentagon representing the name of the pentagonal Washington headquarters of the U.S. defence forces. In the first case the semantic object is derived from a polygon class and in the second case it is derived from an organization names class.
6.
Experiments
In order to demonstrate, test and evaluate the method of adaptive dialogue based on multimodal language acquisition we implemented this method on a personal computer running the Windows 2000 operating system by creating a graphic application with multiple input modalities including speaking, pointing with a mouse, typing on a keyboard, drawing on a pen tablet with a stylus and capturing images with a CCD camera. In this application, users can use spoken commands to display, move, delete and rotate graphical objects on the screen, or to assign or change the values of different application variables. The system is able to identify in the user’s commands new words or phrases and to ask the user to provide the corresponding semantic representation. Upon receiving the semantics of the new concepts, the system stores their surface and semantic linguistic information and is able to recognize and understand these new concepts in future utterances. Users can also teach the computer system new concepts by typing. This modality is fast and accurate and it is suitable when the user realizes that a certain word or phrase is not known by the application. The graphic application is implemented in JAVA and the grammar is written using Java Speech Grammar Format specification. The automatic speech recognition and text-to-speech engines used are from the IBM ViaVoice Release 8 Professional Edition. This ASR engine contains a dictation grammar which includes a vocabulary of 160,000 words. The initial language, recognized and understood by the system, is pre-programmed in the grammar and the semantic database. The initial grammar consists of 25 production rules and 23 non-terminal symbols representing concept classes such as display variables, actions, arithmetic operations, colours, 2D geometric shapes, regular polygons, graphic images, drawings, user names, etc. The initial semantic database contains semantic objects derived from concept classes, including
348
Advances in Natural Multimodal Dialogue Systems
Table 15.1. Example dialogue in the computer graphics task. User: Computer: User: User: Computer: Computer: User: Computer: User: Computer: User: Computer: Computer: User: Computer: User: Computer: User: Computer: User: Computer:
Select the pink colour. I don’t know what pink means. Show me the colour palette. Pink is this colour. Thank you. I learned what pink is. Selecting the pink colour. Create a rectangle. Creating a rectangle. Create a pentagon. I don’t know what a pentagon is. A pentagon is a polygon with five sides. Thank you. I learned what pentagon is. Creating a pentagon. Rotate the rectangle minus sixty three degrees. Rotating the rectangle -63 degrees. Now the pentagon. Rotating the pentagon -63 degrees. Move this here. Moving the rectangle here. Delete this. Deleting the pentagon.
colours such as red, green and blue, geometric shapes such as circle and line, regular polygons such as triangle and square, display variables such as radius, width and height, etc. The application has been evaluated and tested for several tasks. Three of these experiments, each performed by a different male user are presented in the following. In a computer graphics task the user teaches the system different colour names such as yellow, pink, mustard, burgundy, etc. and the corresponding computer representations (RGB attributes). Also the computer acquires new concepts describing regular polygons such as pentagon, hexagon or octagon, and a large number words associated with different kinds of drawings, such as door, window, house, fence, cloud, tree, mountain, dog, cat, eye, nose, mouth, etc. Table 15.1 shows an example of a typical dialogue with the computer for the computer graphics task. Elliptical inference of the object pentagon is achieved by the computer from the context for the user’s utterance ‘Now the pentagon’. Also the computer is able to solve deictic references such as this and here using the pointing coordinates at the time when these words occurred. Figure 15.6 shows a graphic screen for a session in which the first user has taught the computer the graphic representations for the concepts: hair, face contour, left eye, right eye, nose, left ear, right ear, mouth and head, as
Adaptive Human-Computer Dialogue
Figure 15.6.
349
Example screen from a computer graphics task.
displayed on the right hand side of the figure. Each new word or phrase was spoken or typed and the computer asked for semantic representations which were provided by the user by drawing the corresponding graphics. After all of these primitive concepts were taught, the user created a composition of these graphical elements which represented a head and then taught the computer the concept head by associating it with the graphic composition. Thus, new language knowledge was structured upon primitive or lower level concepts. The two heads in the picture were then created by the user by saying two times ‘Create a head’ and simultaneously pointing the cursor with the mouse to the desired location. The 9 concepts were taught in about 8 minutes, by a male user familiar with the application. Most of the session time was spent in drawing. In a second experiment, another user drew military icons or symbols for representing different concepts such as tank, helicopter, truck, etc., and associated these drawings with the corresponding words acquired by typing on the keyboard. Figure 15.7 shows a graphic screen for a session in which the user taught the computer the graphic representations for the terms: tank, helicopter, jeep, truck, barracks, ammunition, soldier, lake and mountain. The 9 concepts were taught in about 15 minutes by the user, after familiarization with the ap-
350
Advances in Natural Multimodal Dialogue Systems
Figure 15.7.
Example screen from a military application task.
plication. Then, using speech and pointing, the user was able to place, move, rotate and delete these graphic symbols on the screen, simulating a mission plan, and created a map as shown on the right hand side of the figure. In the third experiment, another user drew electronic components for constructing electrical diagrams and associated them with different names by typing. Figure 15.8 shows a graphic screen for a session in which the user taught the computer the graphic representations for the concepts: resistor, capacitor, operational amplifier, ground, plus, minus, contact point and connecting wire. Some of these graphic elements are displayed on the left column of the screen. The 8 concepts were taught in about 10 minutes by the third user, after familiarization with the application. Then using speech and pointing the user was able to place these graphic symbols on the screen and create an electrical circuit diagram which was given the name Diagram 7. As presented earlier, users can create more complex concepts by building on the existing lower level concepts. For example, two primitive concepts, rectangle and horizontal line can be used to build a graphic representation of a resistor. Then a resistor and a capacitor can be used to build a parallel RC circuit and further this circuit can be used to build an audio amplifier, which
Adaptive Human-Computer Dialogue
Figure 15.8.
351
Example screen from the drawing electronic diagrams task.
can be used to build the schematic diagram of a radio receiver. Or, as for the computer graphics application, one can build (by using the concepts circle and black) a pupil, then using the concepts circle and a colour concept such as brown or blue, an iris which can be used to build an eye, which can be used to build a face which can be used to build a head and so on. This feature can be used to create a language structure of the concepts for a specific application as can be seen in Figure 15.9.
7.
Conclusion
Human-computer interaction can be made more natural if spoken dialogue and multimodal interfaces are employed. We presented in this chapter a method for making the dialogue of such spoken interfaces adaptive by the users. The method requires a pre-programmed kernel of language knowledge included in a semantic grammar and a semantic database. Users can then expand the existing concepts from each concept class by using a special set of rules, pre-programmed in the grammar. When a new concept is taught, the user must indicate to the computer the concept class to which the new concept belongs and must provide a semantic representation according to some rules.
352
Advances in Natural Multimodal Dialogue Systems
Structure Complexity complex head
face mouth
eye
iris
hair
eyelash
pupil
circle
black
brown
blue
primitive
Figure 15.9.
Building a language structure with concepts.
Synonyms can also be taught in the same concept class. Homonyms can only be taught as concepts in different concept classes. The adaptation of the dialogue in this method is made by acquiring new concepts in the existing concept classes and not by acquiring new rules or new concept classes into the grammar. This restriction is imposed in order to allow the computer to perform the semantic interpretation based on the pre-programmed code without intervention of the developer. The rationale for this is that if a new concept cannot be categorized in the existing semantic classes then a special code needs to be written to interpret the new concept class, and this requires re-programming and compiling the application. This method of adapting the dialogue of a human-computer interface can be implemented relatively easily. Users thus have the capability of adapting the computer dialogue’s vocabulary according to their specific application needs and they can personalize it based on their linguistic preferences. Preliminary experiments targeted at different computer graphics applications showed a potential benefit of this method. It is easier and faster, for example, to create multiple instances of the same symbol or drawing on the computer screen by combining speech and pointing, than by re-drawing the same symbol each time. Usability studies are required for quantitative evaluation of the benefits of language acquisition and vocabulary adaptation. These evaluations are necessary as new computer applications of this research are identified.
Adaptive Human-Computer Dialogue
353
A future direction of this research area would be the vertical adaptation of the human-computer dialogue, by acquiring from users new semantic classes of concepts and new language rules.
Acknowledgements This research was supported by the National Science Foundation under the Knowledge and Distributed Intelligence project, grant NSF IIS-98-72995.
References Bolt, R. A. (1980). Put-That-There: Voice and Gesture at the Graphics Interface. Computer Graphics, 14(3):262–270. Chomsky, N. (1986). Knowledge of Language: Its Nature, Origin, and Use. New York: Praeger. Cohen, P. R., Johnston, M., McGee, D., Oviatt, S. L., Clow, J., and Smith, I. (1998). The Efficiency of Multimodal Interaction: A Case Study. In Proceedings of International Conference on Spoken Language Processing (ICSLP), volume 2, pages 249–252, Sydney, Australia. de Mori, R. (1999). Recognizing and Using Knowledge Structures in Dialog Systems. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 297–307, Keystone, Colorado, USA. de Villiers, J. G. and de Villiers, P. A. (1978). Language Acquisition. Cambridge, Massachusetts, and London, England: Harvard University Press. Dusan, S. and Flanagan, J. (2001). Human Language Acquisition by Computers. In Proceedings of the International Conference on Robotics, Distance Learning and Intelligent Communication Systems, pages 387–392, Malta. WSES/IEEE. Dusan, S. and Flanagan, J. (2002a). Adaptive Dialog Based upon Multimodal Language Acquisition. In Proceedings of the Fourth International Conference on Multimodal Interfaces (ICMI), pages 135–140, Pittsburgh, PA, USA. Dusan, S. and Flanagan, J. (2002b). An Adaptive Dialogue System Using Multimodal Language Acquisition. In Proceedings of the International CLASS Workshop: Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems, pages 72–75, Copenhagen, Denmark. Gavalda, M. and Waibel, A. (1998). Growing Semantic Grammars. In Proceedings of the 17th International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), pages 451–456, Montreal, Quebec, Canada. Gorin, A. L, Levinson, S. E., Gertner, A., and Goldman, E. (1991). On Adaptive Acquisition of Language. Computer Speech and Language, 5(2):101– 132.
354
Advances in Natural Multimodal Dialogue Systems
Gorin, A. L., Levinson, S. E., and Sankar, A. (1994). An Experiment in Spoken Language Acquisition. IEEE Transaction on Speech and Audio, 2(1):224– 240. Part II. Gorin, A. L., Riccardi, G., and Wright, J. H. (1997). How May I Help You? Speech Communication, 23:113–127. Henis, E .A, Levinson, S. E., and Gorin, A. L. (1994). Mapping Natural Language and Sensory Information into Manipulatory Actions. In Proceedings of the Yale Workshop on Adaptive and Learning Systems, pages 324–356, Yale University, New Haven. Miller, L. G. and Gorin, A. L. (1993). Structured Networks for Adaptive Language Acquisition. International Journal of Pattern Recognition and Artificial Intelligence, 7(4):873–898. Oates, T. (2001). Grounding Knowledge in Sensors: Unsupervised Learning for Language and Planning. PhD thesis, MIT. Roy, D. K. (1999). Learning Words from Sights and Sounds: A Computational Model. PhD thesis, MIT, Program in Media Arts and Sciences, School of Architecture and Planning. Sadek, D. and de Mori, R. (1998). Dialogue Systems. In de Mori, R., editor, Spoken Dialogues with Computers, pages 563–582. Academic Press. Sankar, A. and Gorin, A. L. (1993). Adaptive Language Acquisition in a Multisensory Device. In Mammone, R., editor, Artificial Neural Networks for Speech and Vision, pages 324–356. London: Chapman and Hall. Young, S. (2002). Talking to Machines (Statistically Speaking). In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 9–16, Denver, Colorado, USA.
Chapter 16 MACHINE LEARNING APPROACHES TO HUMAN DIALOGUE MODELLING Yorick Wilks, Nick Webb, Andrea Setzer, Mark Hepple and Roberta Catizone Natural Language Processing Group Department of Computer Science University of Sheffield, UK
{y.wilks, n.webb, a.setzer, m.hepple, r.catizone}@dcs.shef.ac.uk Abstract
We describe two major dialogue system segments: the first is an analysis module that learns to assign dialogue acts from corpora, but on the basis of limited quantities of data, and up to what seems to be some kind of limit on this task, a fact we also discuss. Secondly, we describe a Dialogue Manager which uses a representation of stereotypical dialogue patterns that we call Dialogue Action Frames, which are processed using simple and well understood algorithms, which are adapted from their original role in syntactic analysis role, and which, we believe, generate strong and novel constraints on later access to incomplete dialogue topics.
Keywords:
Dialogue management, machine learning, dialogue acts, dialogue modelling, Dialogue Action Frames.
1.
Introduction
Computational modelling of human dialogue is an area of NLP where there are still a number of open research issues about how such modelling should best be done. Most research systems so far have been largely hand-coded, inflexible representations of dialogue states, implemented as some form of finite state or other rule-based machine (e.g. TRINDI systems at the most theoretical end [Larsson and Traum, 2000]). These approaches have addressed robustness issues within spoken language dialogue systems by limiting the range of the options and vocabulary available to the user at any given stage in the dialogue. They have, by common agreement, failed to capture much of the flexibility and functionality inherent in human-human communication, and the resulting 355 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 355–370. © 2005 Springer. Printed in the Netherlands.
356
Advances in Natural Multimodal Dialogue Systems
systems have far less than optimal conversational capability and are neither pleasant nor natural to use. Meanwhile, many low-functionality systems have, however, been deployed in the market in domains such as train reservations. On the other hand, more flexible, conversationally plausible models of dialogue, such as those based on planning, [Allen et al., 1995] are knowledge rich, and require very large amounts of manual annotation to create. They model individual communication actions, which are dynamically linked together into plans to achieve communicative goals in a tradition of work on trains going back to Perrault and his students [Allen and Perrault, 1980]. This method has greater scope for reacting to user input and correcting problems as they occur. Both types of contemporary theoretical model above have long histories, but have never placed their emphasis on implementation or evaluation. The only regular forum for the evaluation of dialogue systems has been the Loebner Competition [1990], whose value has not been universally accepted, but which has served the field by keeping the emphasis on evaluation and innovation. Some of the current authors were among the designers of the winning Loebner entry in 1997, and some principles from that system CONVERSE [Levy et al., 1997] have survived in what we propose here, namely: machine learning, knowledge of conversational strategy encoded in flexible script structures rather than full plans, and a mechanism for judging the relative balance of system and user initiatives. The model we wish to present occupies a position between these two approaches: full planning systems and turned-based dialogue move engines. We contend that larger structures are necessary to represent the content and context provided by mini-domains or meta-dialogue processes (a term we shall explain) as opposed to modelling only turn taking. The traditional problems with our position are: how to obtain the data that such structures (which we shall call Dialogue Action Frames or DAFs) contain, and how to switch rapidly between them in practice, so as not to be stuck in a dialogue frame inappropriate to what a user has just said. We shall explain their functioning within an overall control structure that stacks DAFs, and show that we can leave a DAF in any dialogue state and return to it later if appropriate, so that there is no loss of flexibility, and we can retain the benefits of larger scale dialogue structure. For now, DAFs are hand-coded but we shall show later in the chapter how we are seeking to learn them from annotated dialogue corpora. In so doing, we hope to acquire those elements of human-human communication which may make a system more conversationally plausible. The second major area that remains unsettled in dialogue modelling is the degree to which its modules can be based directly on abstractions from data (abstractions usually obtained by some form of Machine Learning) as significant parts of NLP have been over the last fifteen years. We shall describe a
Machine Learning Approaches to Human Dialogue Modelling
357
system for learning the assignment of dialogue acts (DAs) and semantic content directly from corpora, while noting the following difficulty: in the five years since Samuel et al. [1998] first demonstrated such a technique based on Transformation-Based Learning (TBL, [Brill, 1995]) the figures obtained [Stolcke et al., 2000] have remained obstinately in the area of 65%+ and not risen towards the Nineties as has been the case in other, perhaps less complex, areas of linguistic information processing, such as part-of-speech tagging. In the model that follows, we hypothesise that the information content of DAs may be such that some natural limit has appeared to their resolution by the kinds of n-gram-based corpus analysis used so far, and that the current impasse, if it is one, can only be solved by realising that DA training is inherently low quality and that higher level dialogue structures in the DM will be needed to refine the input DAs, that is, by using the inferential information in DAFs, along with access to the domain model. This hypothesis, if true, explains the lack of progress with a purely data-driven research in this area and offers a concrete hybrid model, to employ an overused word in the area of NLP and ML. This process could be seen as one of the correction or reassignment of DA tags to input utterances in a DM, where a higher level structure will be able to chose from some (possibly ordered) list of alternative DA assignments as selected by our initial process.
2.
Modality Independent Dialogue Management
Any survey of this field might suggest that we may now be in something of the same position as the field of Information Extraction (IE) when Jerry Hobbs [1993] wrote his brief note on a generic IE system, on the assumption that all functioning IE systems contained roughly the same modules doing the same tasks, and that the claimed differences were largely matters of advertising, plus dubious claims to special theories with trademarked names. However, and as we noted earlier, we may be in a worse position with dialogue systems because, unlike IE, there is substantial disagreement on the core control structure of a dialogue system, and little or no benchmarked performance with which to decide which modules and theoretical features aid robust dialogue performance. The lack of an established evaluation methodology is, by common consent, one of the main features holding back robust dialogue development methodology. We cannot even appeal to some accepted progression of systems towards an agreed level of maturity, as one can in some areas of NLP like IE: even very primitive dialogue systems from long ago contain features which many would associate with very sophisticated systems: Carbonell’s POLITICS [Carbonell et al., 1983] seems to be just a series of questions and answers to a complex knowledge base, but it is clear that he considers the system to deploy coded
358
Advances in Natural Multimodal Dialogue Systems
forms of goals, beliefs and plans, which one might take as a sufficient property for being in a more developed class of systems. Even PARRY [Colby, 1971] the most developed and robust of the early dialogue systems, but based on no more than fast pattern matching, very clearly had the goal of informing the user of certain things and, even though it had no explicit representation of goals and beliefs, it did have a primitive but explicit model of the user.
2.1
Initial Design Considerations
The development of our Dialogue Management strategies has occurred largely within the COMIC (Conversational Multimodal Interaction with Computers)1 project whose object is to build a cooperative multi-modal dialogue system which aids the user in the complex task of designing a bathroom, and a system to be deployed in a showroom scenario. A central part of this system is the Dialogue and Action Manager (DAM). We assumed that a plausible DAM system must be able to have at least the following functionalities: (a) determine the form of response appropriately, to dialogue turn pairs, where appropriately means in both pragmatic (i.e. dialogue act functional) and semantic terms (i.e. give correct answers to questions, if known). (b) have some form of representation of a whole dialogue, which means not only opening and closing it appropriately, but knowing when a topic has been exhausted, and also how and when to return to it, if necessary, even though exhausted from the system’s point of view. (c) have appropriate access to a data base if there is to be question-answering on the basis of stored (usually application dependent) knowledge. (d) have appropriate access to a database that can be populated if information is to be elicited from the user as part of a basic task. (e) have some form of reasoning, belief/goal/intention representation, user modelling and planning sufficient to perform these tasks, though this need not imply any particular form of representation or mechanism for implementing these functionalities. (f) have some general and accessible notion of where in the whole dialogue and task performance the system is at any moment. The key problems for dialogue system performance, and therefore reasons for failure, are: (i) the inability of a dialogue system to find the relevant structure/frame that encapsulates what is known to the system about the subject under discussion and to use this to switch topics when the user dictates that. This is the main
1 see
http://www.hcrc.ed.ac.uk/comic/
Machine Learning Approaches to Human Dialogue Modelling
359
form of what we shall call the frame detection problem in dialogue management, one normally addressed by some level of overlap of terms in the input with indexes attached to particular task frames in the current application. (ii) another problem for all dialogue systems is recovery from not knowing how to continue in a given dialogue state, and quite different strategies are out there in the field: e.g. the Rochester-style strategy [Allen and Perrault, 1980] of the system taking a definite, and possibly wrong, line with the user, relying on robust measures for revision and recovery if wrong, as opposed to a hesitant (and potentially irritating) system that seeks constant confirmation from the user before deciding on any action. We shall also opt for the former strategy, and hope for sufficiently robust recovery, while building in implicit confirmations for the user wherever appropriate. We anticipate a core dialogue engine that is both a simple and perspicuous virtual machine (and not a lot of data/links and functionalities under no clear control) and which can capture (given good data structures) the right compromise between push (user initiative) and pull (system initiative) that any robust system must have. Our DAM sketch below, now implemented and integrated into the COMIC project, is intended to capture this combination of perspicuity (for understanding the system and allowing data structures to be written for it) and compromise between the two opposed dialogue initiative directions.
2.2
Choosing a Level of Structure
There is as yet no consensus as to whether a DAM should be expressed simply as a finite-state automaton, a well understood and easy to implement representation, or utilise more complex, knowledge-based approaches such as the planning mechanism employed by systems such as TRAINS [Allen et al., 1995]. The argument between these two views, at bottom, is about how much stereotypy one expects in a dialogue and which is to say, is it how much is it worth collecting all rules relevant to a subtopic together, within some larger structure or partition? Stereotypy in dialogue is closely connected to the notion of system-initiative or top-down control, which is strongest in “form-filling" systems and weakest in chatbots. If there is little stereotypy in dialogue turn ordering, then any larger frame-like structure risks being over-repetitious, since all possibilities must be present at many nodes. If a system must always be ready to change topic in any state, it can be argued, then what is the purpose of being in a higher level structure that one may have to leave? The answer to that it is possible to be always ready to change topic but to continue on if change is not forced: As with all frame-like structures since the beginning of AI, they express no more than defaults or preferences.
360
Advances in Natural Multimodal Dialogue Systems
The same opposition was present in early AI planning theory between ruledriven planners and systems like SRI’s STRIPS that pioneered more structural objects consisting of expected default actions [Fikes and Nilsson, 1971]. The WITAS system [Lemon et al., 2001] was initially, at least, based on networks of ATN (Augmented Transition Network) structures, stacked on one of two stacks. In the DAM described below we also opt for an ATN-like system which has as its application mechanism a single stack (with one slight modification) of DAF’s (Dialogue Action Frames) and suggest that the WITAS argument for abandoning an ATN-type approach (namely, that structure was lost when a net is popped) is easily overcome. We envisage DAFs of radically different sizes and types: complex ones for large scale information eliciting tasks, and small ones for dialogue control functions such as seeking to reinstate a topic. Our argument will be that the simplicity and perspicuity of this (well understood and easily written and programmed) virtual machine (at least in its standard form) has benefits that outweigh its disadvantages, and in particular the ability to leave and return to a topic in a natural and straightforward way. As we shall see below, this is a complex issue, and the need to return to unpopped syntactic ATN networks, so as to ensure completeness of parsing, is quite different in motivation from that of returning to an interrupted topic in dialogue processing. In syntactic parsing one must so return, but in dialogue one can sometimes return in a way that is pragmatically inappropriate and we shall deal with that below, and seek new forms of dialogue constraint and generalization.
2.3
DAFs: A Modest DAM Proposal
We propose a single pop-push stack architecture that loads structures of radically differing complexities but whose overall forms are DAFs. The algorithm to operate such a stack is reasonably well understood, though we will suggest below one amendment to the classical algorithm, so as to deal with a dialogue revision problem that cannot be dealt with by structure nesting. The general argument for such a structure is its combination of power, simplicity and perspicuity. Its key language-relevant feature (known back to the time of Woods [1970] in syntactic parsing) is the fact that structures can be pushed down to any level and re-entered via suspended execution, which allows nesting of topics as well as features like barge-in and revision with a smooth and clear return to unfinished materials and topics. This is so well known that it has entered the everyday language of computer folk as “stack that topic for a moment". Although, in recursive syntax, incomplete parsing structures must be returned to and completed, in dialogue one could argue that not all incomplete structures should be re-entered for completion since it is unnat-
Machine Learning Approaches to Human Dialogue Modelling
361
ural to return to every suspended topic no matter how long suspended, unless, that is, the suspended structure contains information that must be elicited from the user. One experimental question here will be whether there are constraints on such re-entry to suspended networks, analogous to the semantic networks in Grosz’s [1977] dialogue systems and the absolute constraints she proposed on long range reference back to open topics. There will be DAFs corresponding to each of the system-driven sub-tasks (e.g. filling of the forms the bathroom salesman aims to end up with at the end of a client session) which are for eliciting information and whose commands write directly to the output database. There will also be DAFs for standard Greetings and Farewells, and for complex dialogue control tasks like revisions and responses to conversational breakdowns. A higher granularity of DAFs will express simple dialogue act pairs (such as QA) which can be pushed at any time (from user initiative) and will be exhausted (and popped) after an SQL query to the COMIC database. The stack is preloaded with a (default) ordered set of system initiative DAFs, with Greeting at the top, Farewell at the bottom and such that the dialogue ends with maximum success when these and all the intermediate information eliciting DAFs for this task have been popped. This would be the simplest case of a maximally cooperative user with no initiative whatever; he may be rare but must be catered for if he exists. An obvious problem arises here, noted in earlier discussion, which may require that we adapt this overall DAM control structure: If the user proposes an information eliciting task before the system does (e.g. in a bathroom world, we suppose the client wants to discuss tile-colour-choice before that DAF is reached in the stack) then that structure must immediately be pushed onto the stack and executed till popped, but obviously its copy lower in the stack must not be executed again when it reaches the top later on. The integrity of the stack algorithm needs to be violated only to the extent that any task-driven structure at the top of the stack is only executed from its initial state if the relevant part of the database is incomplete. However, a closely related, issue (and one that caused the WITAS researchers to change their DAM structure, wrongly in our view) is the situation where a user-initiative forces the revision/reopening of a major topic already popped from the stack; e.g. in a bathroom world, the user has chosen pink tiles but later, and at her own initiative, decides she would prefer blue and initiates the topic again. This causes our proposal no problems: the tile-colour-choice DAF structure is pushed again (empty and uninstantiated) but with an entry subnetwork (no problem for DAFs) that can check the data-base, see it is complete, and begin the subdialogue in a way that responses show the system knows a revision is being requested. It seems clear to us that a simple stack architecture is proof against arguments based on the need to revisit popped structures,
362
Advances in Natural Multimodal Dialogue Systems
provided the system can distinguish this case (as user initiative) from the last (a complete structure revisited by system initiative). A similar device will be needed when a partly executed DAF on the stack is re-entered after an interval; a situation formally analogous to a very long syntactic dependency or long range co-reference. In such cases, a user should be asked whether he wishes to continue the suspended network (to completion). It will be an experimental question later, when data has been generated, whether there are constraints on access to incomplete DAFs that will allow them to be dumped from the top of the stock unexecuted, provided they contain no unfilled requests for bathroom choices. What has not been touched upon here is the provision, outside the main stack and content-structures, of DAM modules that express models of the user’s goals/beliefs/intentions and which reason over these. We shall postpone this discussion as inessential for getting a DAM started and able to generate dialogue data for later learning and modification (see further below), provided what we ultimately propose can transition from simpler to more complex structures and functions without radical redesign. To deploy such capacity for bathroom advice would require an implausible scenario where the advisor has to deal with e.g. a client couple, possibly interviewed separately so that the system has to construct a couple’s views of each other’s wishes. We expect later to build into the DAM an explicit representation of plan tasks, and this will give no problem to a DAF since recursive networks can be, and often have been, a standard representation of plans, which makes it odd that some redesigners of DAM’s have argued against using ATNs as DAM models, wrongly identifying them with low-level dialogue grammars, rather than, as they are, structures (ATNs) more general than those for standard plans (RTNs). As a model of goals intentions and beliefs of the dialogue participants, we expect to use our procedural ViewGen [Ballim and Wilks, 1991] model.
3.
Learning to Annotate Utterances
In the second part of this chapter, we will focus on some experiments on modelling aspects of dialogue directly from data. In the joint EU-, US- funded project AMITIES we are building automated service counters for telephonebased interaction, by using large amounts of recorded human-human data. Initially, we report on some experiments on learning the analysis part of the dialogue engine; that is, that part which converts utterances to dialogue act and semantic units. Subsequently, we will outline the next stage of research, that of learning dialogue structure from a corpus.
Machine Learning Approaches to Human Dialogue Modelling
3.1
363
Machine Learning for Dialogue Act Tagging
There has been an increasing interest in using machine learning techniques to solve problems in spoken dialogue. One thread in this work has addressed dialogue act modelling, i.e. the task of assigning an appropriate dialogue act tag to each utterance in a dialogue. It is only recently, with the availability of annotated dialogue corpora, that empirical research in this area has become possible. Two key annotated corpora, which have formed the basis for work on dialogue act modelling, are of particular relevance here: first, the VerbMobil corpus [Reithinger and Klesen, 1997], which was created within the project developing the VerbMobil speech-to-speech translation system, and secondly, the Switchboard corpus [Jurafsky et al., 1998]. Of the two, Switchboard has generally been considered to present a more difficult problem for accurate dialogue act modelling, partly because it has been annotated using a total of 42 distinct dialogue acts, in contrast to the 18 used in the VerbMobil corpus, and a larger set makes consistent judgements harder. In addition, Switchboard consists of unstructured non-directed conversations, which contrast with the highly goal-directed dialogues of the VerbMobil corpus. One approach that has been tried for dialogue act tagging is the use of ngram language modelling, exploiting ideas drawn directly from speech recognition. For example, Reithinger and Klesen [1997] have applied such an approach to the VerbMobil corpus, which provides only a rather limited amount of training data, and report a tagging accuracy of 74.7%. [Stolcke et al., 2000] apply a somewhat more complicated n-gram method to the Switchboard corpus (which employs both n-gram language models of individual utterances, and ngram models over dialogue act sequences) and achieve a tagging accuracy of 71% on word transcripts, drawing on the full 205k utterances of the data. Of this, 198k utterances were used for training, with a 4k utterance test set. These performance differences can be seen to reflect the differential difficulty of tagging for the two corpora. A second approach by Samuel et al. [1998], uses transformation-based learning over a number of utterance features, including utterance length, speaker turn and the dialogue act tags of adjacent utterances. They achieved an average score of 75.12% tagging accuracy over the VerbMobil corpus. A significant aspect of this work, that is of particular relevance here, has addressed the automatic identification of word sequences that would form dialogue act cues. A number of statistical criteria are applied to identify potentially useful n-grams which are then supplied to the transformation-based learning method to be treated as ‘features’.
364
3.2
Advances in Natural Multimodal Dialogue Systems
Creating a Naive Classifier
As just noted, Samuel et al. [1998] investigated methods for identifying word n-grams that might serve as useful dialogue act cues for use as features in transformation-based learning. We decided to investigate how well n-grams could perform when used directly for dialogue act classification, i.e. with an utterance being classified solely from the individual cue phrases it contains. Two questions immediately arise. Firstly, which n-grams should be accepted as cue phrases for which dialogue acts, and secondly, which dialogue act tag should be assigned when an utterance contains several cues phrases that are indicative of different dialogue act classes. In the current work, we have answered both of these questions principally in terms of predictivity, i.e. the extent to which the presence of a certain n-gram in an utterance is predictive of it having a certain dialogue act category, which for an n-gram n and dialogue act category d corresponds to the conditional probability: P (d | n). A set of n-gram cue phrases was derived from the training data by collecting all n-grams of length 1–4, and counting their occurrences in the utterances of each dialogue act category and in total. These counts allow us to compute the above conditional probability for each n-gram and dialogue act. This set of n-grams is then reduced by applying thresholds of predictivity and occurrence, i.e. eliminating any n-gram whose maximal predictivity for any dialogue act falls below some minimum requirement, or whose maximal number of occurrences with any category falls below a threshold value. The n-grams that remain are used as cue phrases. The threshold values that were used in our experiments were arrived at empirically. To classify an utterance, we identify all the cue phrases it contains, and determine which has the highest predictivity of some dialogue act category, and then that category is assigned. If multiple cue phrases share the same maximal predictivity, but predict different categories, one category is assigned arbitrarily. If no cue phrases are present, then a default tag is assigned, corresponding to the most frequent tag within the training corpus.
3.3
Corpus and Data Sets
For our experiments, we used the Switchboard corpus, which consists of 1,155 annotated conversations, comprising around 205k utterances. The dialogue act types for this set can be seen in [Jurafsky et al., 1997]. From this source, we derived two alternative datasets. Firstly, we extracted 50k utterances, and divided this into 10 subsets as a basis for 10-fold cross-validation (i.e. giving 45k/5k utterance set sizes for training/testing). This volume was selected as being large enough to give an idea of how well methods could perform where a good volume of data was available, but not too large to prohibit experiments with 10-fold cross-validation from excessive training times. The
Machine Learning Approaches to Human Dialogue Modelling
365
second data set was selected for loose comparability with the work of Samuel, Carberry and Vijay-Shanker on the VerbMobil corpus, who used training and test sets of around 3k and 300 utterances. Accordingly, we extracted 3300 utterances from Switchboard, and divided this for 10-fold cross-validation.
3.4
Experiments
We evaluated the naive tagging approach using the two data sets just described, in both cases using a predictivity threshold of 0.25 and an occurrence threshold of 8 to determine the set of cue phrases. Applied to the smaller data set, the approach yields a tagging accuracy of 51.8%, which compares against a baseline accuracy of 36.5% from applying the most frequently occurring tag in the Switchboard data set (which is sd — statement). Applied to the larger data set, the approach yields a tagging accuracy of 54.5%, which compares to 33.4% from using the most frequent tag. More recent experiments suggest that we can dramatically improve this score. We introduced start and end tags to every utterance (to capture phrases which serve as cues when specifically in these locations), and trained specific utterance length models. For example, we trained three models — one for utterances of length 1, another for length between 2 and 4 words, and another for length 5 and above. Combining these features, we obtained a maximal score for our naive tagger of 63% over the larger data set. Given that Stolke et al. achieve a total tagging accuracy of around 70% on Switchboard data, we observe that our approach goes a long way to reproducing the benefits of that approach, but using only a fraction of the data, and using a much simpler model (i.e. individual dialogue act cues, rather than a complete n-gram language model).
3.5
Experiments with Transformation-Based Learning
Transformation-based learning (TBL) was first applied to dialogue act modelling by Samuel, Carberry and Vijay-Shanker. They achieved overall scores of 75.12% tagging accuracy, using the VerbMobil corpus. As previously noted, an aspect of their work addressed the identification of potential cue phrases, for use as features during transformation based learning, i.e. so transformation rules can be learned which require the presence of a given cue as a context condition for the rule firing. In that work, the initial tagging state of the training data from which TBL learning would begin was produced by assigning every utterance a default tag corresponding to the most frequent tag over the entire corpus. In our experiments, we wanted to investigate two issues. Firstly, whether a more effective dialogue act tagging approach could be produced by using our naive n-gram classifier as a pre-tagger generating the initial tagging state over
366
Advances in Natural Multimodal Dialogue Systems
which a TBL tagger could be trained. It seems plausible that the increased accuracy of the initial tag state produced by the naive classifier, as compared to assigning just the most frequent tag, might provide a basis for more effective subsequent training. Secondly, we wanted to assess the impact of using larger volumes of training data with a transformation based approach, given that Samuel et al.’s results are based on a quite small data set from the VerbMobil corpus. For an implementation of transformation based learning, we used the freely available µ-TBL system of Lager [1999]. The current distribution of µ-TBL provides an example system for dialogue act modelling, including a simple set of templates, which is developed with reference to the Samuel et al. work, and applied to the MapTask domain [Lager and Zinovjeva, 1999]. We have used this set of templates for all our experiments. We should note that the Lager and Samuel et al. template sets differ considerably, e.g. Samuel et al. use thousands of templates (together with a Monte Carlo variant of the training algorithm), whilst the µ-TBL templates are much fewer in number, may refer only to the dialogue act tags of preceding utterances (i.e. not both left and right), and may refer to any unigram or bigram appearing within utterances as part of a context condition, i.e. they are not provided with a fixed set of dialogue act cues to consider. Our best results over the larger data set from the SwitchBoard corpus are around 66%, applying TBL to an initial data state produced by the naive classifier. Interestingly, our results indicate the naive classifier achieves most of the gain, with TBL consistently adding only 2 or 3%. In further work, we intend to apply other machine learning algorithms to the results of pre-tagging the data using a naive classifier.
4.
Future work: Data Driven Dialogue Discovery
Using the same corpus as above, to what extent could we discover the structure of DAFs, and their bounds from segmentations of the corpus, from annotated corpora? We are currently exploring the possibility of deriving the DAF structures themselves by taking a dialogue-act annotated corpus and then annotating it further with an information extraction engine [Larsson and Traum, 2000] that seeks and semantically tags major entities and their verbal relations in the corpus, which is to say, derives a surface-level semantic structure for it. One function of this additional semantic tagging will be to add features for the DA tagging methods already described, in the hope of improving our earlier figures by adding semantic features to the learning process. The other, and more original possibility, is that of seeking repeated sequences of DA+ semantic triple type (verb, plus agent and object types) and endeavouring to optimise the “packing” of such sequences to fill as much as
Machine Learning Approaches to Human Dialogue Modelling
367
possible of a dialogue by using some algorithm such as Minimum Description Length, so as to produce reusable, stereotypical, dialogue segments. We anticipate combining this with some corpus segmentation by topic alone, following Hearst’s tiling technique [Hearst, 1993]. Given any success at learning the segmentation of dialogue data, we expect to use some form of the Reinforcement Learning approach of Walker [1990] to optimise the DAF’s themselves.
5.
Discussion
The work in the last section is at a very preliminary stage, but will we hope form part of the general strategy of this chapter which is to derive a set of weak, general, learning methods which will be strong in combination (in the spirit of Newell [1990]). This means the effect of the combination of a top down DAM using DAFs learned from corpus data, with a DA+ semantics parser learned from the same data. It is the interaction of these bottom-up and top-down strategies in dialogue understanding (where the former is taken to include the ambiguities derived from the acoustic model) that we seek to investigate. This can perfectly well be seen as part of the program for a full dialogue model laid out in [Young, 2000] in which he envisaged a dialogue model as one where different parts are separately observed and framed before being combined. We have shown that a simple dialogue act tagger can be created that uses just n-gram cues for classification. This naive tagger performs modestly, but still surprisingly well given its simplicity. More significantly, we have shown that a naive n-gram classifier can be used to pre-tag the input to transformation based learning, which removes the need for a vast number of n-gram features to be used in the learning algorithm. One of the prime motivators for using TBL was its resilience to such a high number of features, so by removing the need to incorporate them, we are hopeful that we can use a range of ML approaches for this task. In regard to the naive n-gram classifier, we have described how the training of the classifier involves pruning the n-gram by applying thresholds for predictivity and absolute occurrence. These thresholds, which are empirically determined, are applied globally, and will have a greater impact in eliminating possible n-gram cues for the less frequently occurring dialogue act types. We aim to investigate the result of using local thresholds for each dialogue act type, in an attempt to keep a adequate n-gram representation of all dialogue acts types, including the less frequently occurring ones. Finally, we aim to apply these techniques to a new corpus collected for the AMITIES project, consisting of human-human conversations recorded in the call centre domain [Hardy et al., 2002]. We hope that the techniques outlined
368
Advances in Natural Multimodal Dialogue Systems
here will prove a useful first step in creating automatic service counters for call centre applications. Although we have described a dialogue analysis approach in one project (AMITIES) and a DAM in another (COMIC), this is merely a side effect of funding strategy and we expect to bring the two together in a single system, along with an appropriate generation component and ASR front end. With the generation of more data from our already functioning DAM, we hope to derive constraints on stack access and the reopening of all unpopped DAFs. This, if successful, will be an important demonstration of the different functioning of DAFs as contrasted with the use of ATNs in syntactic analysis (e.g. [Woods, 1970]) where non-determinism requires both back tracking and the exhaustion of all unpopped ATN’s for completeness and the generation of all valid parsings of a sentence. It should be noted that this is not at all the case here: there is no provision for backtracking in DAFs and we expect to derive strong constraints such that not all unpopped DAFs will be reactivated. Analogous to the early dialogue findings of Grosz [1977] and Reichmann [1985] we expect some unpopped DAFs are not reopenable after substantial dialogue delay, just as they showed that dialogue segments and topics were closed off and became eventually inaccessible. Also, the ATN interpreter, unlike it’s use in syntactic processing, is deterministic, since, in every state, there will be a best match between some arc condition and incoming representations. In this chapter, we have discussed aspects of our approach to dialogue analysis/fusion and control, but have not touched at all on generation/fission and the role of knowledge rich items, such as belief and planning structures, in that phase. All this we shall leave to a later paper.
Acknowledgements This chapter is based on work supported in part by the European Commission under the 5th Framework IST/HLT Program (consortia AMITIES and COMIC), and by the U.S. Defense Advanced Research Projects Agency.
References Allen, J. F. and Perrault, C. R. (1980). Analyzing Intentions in Utterances. Journal of Artificial Intelligence, 15(3):143–178. Allen, J. F., Schubert, L. K., Ferguson, G., Heeman, P., Hwang, C. H., Kato, T., Light, M., Martin, N. G., Miller, B. W., Poesio, M., and Traum, D. R. (1995). The TRAINS Project: A Case Study in Building a Conversational Planning Agent. Journal of Experimental and Theoretical AI (JETAI), 7:7–48. Ballim, A. and Wilks, Y. (1991). Artificial Believers. Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Machine Learning Approaches to Human Dialogue Modelling
369
Brill, E. (1995). Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging. Computational Linguistics, pages 234–246. Carbonell, J. G., Michalski, R. S., and Mitchell, T. M. (1983). An Overview of Machine Learning. In Carbonell, J. G., Michalski, R. S., and Mitchell, T. M., editors, Machine Learning: An Artificial Intelligence Approach, pages 168– 185. Palo Alto, CA: Tioga Pub Co. Colby, K. M. (1971). Artificial Paranoia. Journal of Artificial Intelligence, 2:76–89. Fikes, R. E. and Nilsson, N. J. (1971). STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving. In Proceedings of the Second International Joint Conference on Artificial Intelligence (IJCAI-7), volume 1, pages 111–119. Grosz, B. (1977). The Representation and Use of Focus in Understanding Dialogs. In Grosz, B., Jones, K. S., and Webber, B. L., editors, Readings in Natural Language Processing, pages 56–67. Morgan Kaufmann Publishers Inc. Hardy, H., Baker, K., Devillers, L., Lamel, L., Rosset, S., Strzalkowski, T., Ursu, C., and Webb, N. (2002). Multi-Layered Dialogue Annotation for Automated Multilingual Customer Service. In Proceedings of the ISLE Workshop on Dialogue Tagging for Multimodal Human Computer Interaction, pages 90–99, Edinburgh, UK. Hearst, M. A. (1993). TextTiling: A Quantitative Approach to Discourse Segmentation. Technical Report UCB:S2K-93-24, Berkeley, CA. Hobbs, J.R. (1993). The Generic Information Extraction System. In Proceedings of the Fifth Message Understanding Conference (MUC-5), Journal of Artificial Intelligence, pages 87–91. Morgan Kaufman Publishers. Jurafsky, D., Bates, R., Coccaro, N., Martin, R., Meeter, M., Ries, K., Shriberg, E., Stolcke, A., Taylor, P., and van Ess-Dykema, C. (1998). Switchboard Discourse Language Modeling Project Report Research Note 30. Center for Speech and Language Processing, Johns Hopkins University, Baltimore, MD. Jurafsky, D., Shriberg, E., and Biasca, D. (1997). Switchboard-DAMSL Labeling Project Coder’s Manual. Technical Report 97-02, University of Colorado, Institute of Cognitive Science, Boulder, CO. Lager, T. (1999). The µ-TBL System: Logic Programming Tools for Transformation-Based Learning. In Proceedings of the Third International Workshop on Computational Natural Language Learning, pages 190–201, Bergen, Norway. Lager, T. and Zinovjeva, N. (1999). Training a Dialogue Act Tagger with the µ-TBL System. In Proceedings of the Third Swedish Symposium on Mul-
370
Advances in Natural Multimodal Dialogue Systems
timodal Communication, pages 66–87. Linköping University Natural Language Processing Laboratory. Larsson, S. and Traum, D. (2000). Information State and Dialogue Management in the TRINDI Dialogue Move Engine Toolkit. Journal of Natural Language Engineering, 6:267–278. Lemon, O., Bracy, A., Gruenstein, A. R., and Peters, S. (2001). The Witas Multi-Modal Dialogue System I. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech), pages 1559–1562, Aalborg, Denmark. Levy, D., Catizone, R., Battacharia, B., Krotov, A., and Wilks, Y. (1997). CONVERSE: A Conversational Companion. In Proceedings of the First International Workshop on Human-Computer Conversation, pages 27–34, Bellagio, Italy. Loebner Competition (1990). http://www.loebner.net/Prizef/loebner-prize.html. Newell, A. (1990). Unified Theories of Cognition. Cambridge, MA: Harvard University Press. Reichmann, R. (1985). Getting Computers to Talk Like You and Me. Cambridge, MA: MIT Press. Reithinger, N. and Klesen, M. (1997). Dialogue Act Classification Using Language Models. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech), pages 2235–2238, Rhodes, Greece. Samuel, K., Carberry, S., and Vijay-Shanker, K. (1998). Dialogue Act Tagging with Transformation-Based Learning. In Proceedings of the 17th International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), volume 2, pages 1150–1156, Montreal. Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., Taylor, P., Martin, R., van Ess-Dykema, C., and Meteer, M. (2000). Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech. Computational Linguistics, 26(3):339–373. Walker, M. A. (1990). An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email. Journal of Artificial Intelligence Research, 12:387–416. Woods, W. A. (1970). Transition Network Grammars for Natural Language Analysis. Communications of the ACM, 13(10):591–606. Young, S. J. (2000). Probabilistic Methods in Spoken Dialogue Systems. Philosophical Transactions of the Royal Society (Series A), 358(1769):1389– 1402.
Index
Activity Models, 295, 296, 301, 302 AdApt, 234 adaptive dialogue, see dialogue, adaptive animated agent, 34, 36, 184, 194, 216, 230, 233– 235, 237 talking head, 5, 15, 185, 187, 189, 206, 210, 216–221, 223, 224, 227, 231, 236, 245 annotation accuracy, 363, 365 graphs, 85–87, 93 of dialogue acts, 363, 365 schemes, 79, 91–93 tools, 87, 92, 93 architecture, 36, 60–62, 73, 266, 268, 287, 288, 292–294, 298, 301–303, 310, 311, 319, 320, 326, 327, 338, 360, 361 Architecture for Conversational Intelligence (ACI), 288, 294, 298, 300, 301 audio-visual input, 9, 160–162, 167, 168, 174–176 output, 9, 15, 184, 185, 187, 190–192, 206, 207, 210, 215, 216, 221, 236, 237 segmentation, see source segmentation Augmented Transition Network (ATN), 360, 362, 368 August, 33, 233, 234, 237, 238 autism, 194, 197, 200, 201, 205
clarification, 118, 119, 123 coding, see annotation collaboration, 56–58, 64, 66–68, 70–73 Collagen, 59–61, 73 computer games, 99 computer vision, see vision conversation, see dialogue conversational agent, see embodied conversational agent corpora, 80, 90, 91, 123, 125, 141, 356, 357, 363–367
backchannelling, 41, 66, 67, 114, 119, 120, 225 Baldi, 184–188, 193–196, 198, 201, 205–211 BEAT, 61, 94, 95 blind, 132 body movement, 32, 66–68, 71, 80, 93, 97, 100, 101, 103 breakdown, 114, 117–119, 122, 125, 361
education, 56, 184, 195, 197, 211, 216, 288 edutainment, 211 effectiveness, 124, 203 of information presentation, 148, 184, 201, 216, 219 of interaction, 114, 125, 126, 223, 224 of teaching, 194, 197, 204, 208, 209, 211, 235 embodied conversational agent, 24, 25, 27, 32– 34, 40, 42, 45, 46, 48, 49, 245–248, 250, 251, 260 emotions, 223, 237, 238 enabling technologies, 2, 8, 11, 12, 14, 17
damage control, 288, 289, 302 data resources, see corpora deaf, see hearing impaired dialogue acts, 337, 338, 357, 361–367 adaptive, 336, 337, 341, 342, 344, 347 management, 233, 292–296, 298, 300– 302, 312, 313, 318–320, 326, 337, 343, 357, 358 manager, see dialogue management modelling, 356, 357, 362, 367 social, 23–25, 29, 30, 32, 33, 36, 38, 40, 44–46, 48, 49 Dialogue Action Frames (DAFs), 356, 360–362, 366–368 Dialogue and Action Manager (DAM), 358– 362, 367, 368 Dialogue Move Tree, 298–302
camera input, 28, 65, 66, 93, 94, 159–162, 164– 168, 171, 174–178, 336, 338, 342, 346, 347 CHILDES, 92 CLAN, 92
371
372
Advances in Natural Multimodal Dialogue Systems
engagement, 55–59, 62–64, 66–68, 70–74 entertainment, 56, 216 evaluation, 5, 15, 16, 48, 203, 205, 238, 248, 251, 252, 302, 303, 331, 335, 348 perceptual, 190, 192, 193, 217, 233 performance, 197, 198, 200–203, 205, 208, 238, 257 quantitative, 352 subjective, 33, 42, 43, 46, 257, 259 usability, 34, 135, 152 facial expression, 27, 28, 32, 184, 224, 225, 238, 239, 246, 247 feature-based structures, 275, 276, 278, 279 feedback, 26, 27, 34, 35, 41, 116, 195, 196, 201, 215, 216, 224–227, 229, 231, 235, 237, 292, 296, 297, 309–312, 319, 322 Festival, 294 FORM, 79, 80, 85–88, 90–95 fusion, see multimodal fusion future visions, 11 gas turbine control, 59–61 gaze, 25–27, 32, 33, 41, 57, 65, 66, 71, 72, 216, 223, 245, 247–252, 254, 255, 257– 260 Gemini, 293, 294 geometric technique, 160–162, 167, 168, 175 gesture, 26–28, 38, 41, 57, 58, 60, 62, 63, 65, 66, 94, 98–103, 105–108, 298, 312, 338, 339 analysis, 91, 92 annotation, 79, 80, 86, 88–90, 92, 93 facial output, 216, 219, 223–226, 235, 236, 238 pointing, 26, 59, 65, 71, 269, 276, 278, 281, 282 recognition, 277 grammar, 293, 294, 301, 334–336, 338, 340, 341, 343–345, 347, 351, 352 handheld devices, 171, 308, 309, 331 haptic interaction, 308–315, 318–321, 326, 327, 331 hearing impaired, 184, 197, 204, 206–208, 219– 221, 223, 231, 233, 236 history context, 323 dialogue, 266–268, 273, 279–282, 298, 299, 312, 320, 322, 323, 326, 343 modality, 323 hosting, 56, 57, 59, 61–65, 68, 70–74 human-human data, 362, 367 gaze, 247, 248, 250, 251, 259 hosting, 70
interaction, 23, 40, 47, 48, 63, 114, 126, 251, 337 information visualisation, 134, 142, 207, 308– 310, 312–319, 321, 331 intelligibility, 183, 187, 189, 191, 216, 217, 219–221, 232, 233 inter-annotator agreement, 80, 88–90 interpretation context-based, 269, 270, 277, 278, 337, 340, 341, 343 frame, 266 framework, 266, 282 of input, 59, 266, 267, 273, 277, 280, 282, 292, 294, 298, 299, 301, 313, 319, 321, 327 intonation, 26–28, 90 Karin, 246, 247, 251–255, 257, 259, 260 kinematic information, 79, 88, 90, 91 language acquisition, 92, 334–336, 341, 342, 344, 347, 352 learning, 184, 194, 205, 206, 211, 218, 235, 237 likeability, 42, 245 Likert scale, 42, 256 lip reading, see speech reading lips, 172, 176, 185–188, 191, 216, 217, 219, 220, 233 machine learning, 356, 363 medium, defined, 16 Mel, 59–63, 73, 74 messages multimodal, 145, 146, 153 oral, 133, 136, 138, 139, 141–143, 145– 154 visual, 133, 140, 142, 145 MIAMM, 308–311, 313, 314, 317, 319, 320, 323, 326, 327, 331 microphone array, 160–163, 165, 167, 175 MIND, 265–282 mobile devices, see handheld devices modalities investigated, 18 modality defined, 16, 17 theory, see theory, modality multi-speaker speech recognition, 4, 9, 11–13 multilinguality, 308, 331 multimodal fusion, 108, 160, 161, 168, 170, 173, 267, 276, 278, 282, 311, 312, 320, 322, 324 Multimodal Interface Language (MMIL), 326 multimodality defined, 16, 17 state-of-the-art, 131, 132
373
Index music, 246, 308, 317, 324 mutual information, 167, 168, 170, 172–175 natural and multimodal interactivity engineering (NMIE), 1, 2, 7, 8, 19 natural interactivity, 16, 17 defined, 17 needs, 2 natural language understanding, 277, 337, 343 naturalness, 16, 47, 74, 222, 223, 255, 258, 293, 333 Open Agent Architecture (OAA), 294 PACO, 59, 60 pen-based interaction, 338, 339, 342, 347 prosody, see visual prosody psycholinguistics, 114, 127 REA, 27, 28, 32, 36–49 real estate, 27, 29, 32, 37, 40, 42, 231, 234, 266 repair, 113–127 RIA, 265, 266, 268, 269, 271–273, 275, 281 roadmaps, 2, 48 robots, 56, 57, 59, 61–63, 66, 71, 73, 74 satisfaction, see user satisfaction semantic classes, 340, 345, 352 database, 341, 344, 347 fusion, 267 grammar, 335, 336, 351 modelling, 267, 270, 272, 273 objects, 341, 346, 347 representation, 270, 282, 336–338, 341, 343, 346, 347, 349 structure, 270, 281 semantics, 267, 272, 276, 277, 280, 335, 337, 340, 341, 343–345 sensors, 100, 101, 160–162, 164, 167, 310, 335, 336 software engineering, 17 source segmentation, 160–162, 169, 175, 177 speech audio-visual, see audio-visual perception, 191, 193, 206–208, 210 production, 186, 187, 198, 199, 205–209 reading, 161, 191, 220–222 recognition, 167, 337, 342, 347 synthesis, 184–187, 189–191, 216, 218, 221, 222 statistical significance, 141, 142, 257, 258 technique, 160, 161, 167–170, 175 Synface, 232, 233
tagging, see annotation talking face, see animated talking head talking head, see animated talking head Teleface, 232 text-to-speech, see speech, synthesis theory, 7, 8 activity, 302 AI planning, 360 Chomsky, 341 collaboration, 71–73 conversation, 273 engagement, 73 information, 159, 161, 162, 167, 168, 170, 335 modality, 137 tongue, 185–187, 205, 206, 210, 216–218, 220– 222 trust, 24, 29, 30, 33, 35–37, 40–42, 45, 46, 48 turn-taking, 26, 32, 35, 38, 216, 224, 232, 234, 235, 237, 247, 249, 250, 257, 259, 260 tutoring and tutoring applications, see tutors tutors, 56, 57, 59–62, 66, 183, 184, 194–198, 201–203, 205, 208, 210, 235, 287– 303 usability, 131–133, 135, 152, 153 user modelling, 266, 272, 282, 319, 323 preferences, see user modelling profile, see user modelling satisfaction, 16, 152, 153, 255, 257–259 virtual agent, 27, 40, 41, 246, 247, 252 environment, 27, 40, 41, 97, 98, 106, 108, 246, 247, 252 visible speech synthesis, see speech synthesis vision, 32, 63, 73, 160–166 tracking, 160, 162–166, 172, 174, 175, 177 VISLab, 93, 94 visual prosody, 216, 224, 225, 227 search, 133, 134, 136, 140, 141, 143–145, 148–150, 152–154 visualisation, see information visualisation vocabulary learning, 184, 194, 195, 197, 198, 200–203, 205, 210 WaveSurfer, 218, 219, 222, 237 Waxholm, 231, 232 WITAS, 288, 303, 360, 361 Wizard-of-Oz, 41, 97, 99, 105, 107, 108 XML, 93, 219, 235, 326
Text, Speech and Language Technology 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.
H. Bunt and M. Tomita (eds.): Recent Advances in Parsing Technology. 1996 ISBN 0-7923-4152-X S. Young and G. Bloothooft (eds.): Corpus-Based Methods in Language and Speech Processing. 1997 ISBN 0-7923-4463-4 T. Dutoit: An Introduction to Text-to-Speech Synthesis. 1997 ISBN 0-7923-4498-7 L. Lebart, A. Salem and L. Berry: Exploring Textual Data. 1998 ISBN 0-7923-4840-0 J. Carson-Berndsen, Time Map Phonology. 1998 ISBN 0-7923-4883-4 P. Saint-Dizier (ed.): Predicative Forms in Natural Language and in Lexical Knowledge Bases. 1999 ISBN 0-7923-5499-0 T. Strzalkowski (ed.): Natural Language Information Retrieval. 1999 ISBN 0-7923-5685-3 J. Harrington and S. Cassiday: Techniques in Speech Acoustics. 1999 ISBN 0-7923-5731-0 H. van Halteren (ed.): Syntactic Wordclass Tagging. 1999 ISBN 0-7923-5896-1 E. Viegas (ed.): Breadth and Depth of Semantic Lexicons. 1999 ISBN 0-7923-6039-7 S. Armstrong, K. Church, P. Isabelle, S. Nanzi, E. Tzoukermann and D. Yarowsky (eds.): Natural Language Processing Using Very Large Corpora. 1999 ISBN 0-7923-6055-9 F. Van Eynde and D. Gibbon (eds.): Lexicon Development for Speech and Language Processing. 2000 ISBN 0-7923-6368-X; Pb: 07923-6369-8 J. V´eronis (ed.): Parallel Text Processing. Alignment and Use of Translation Corpora. 2000 ISBN 0-7923-6546-1 M. Horne (ed.): Prosody: Theory and Experiment. Studies Presented to G¨osta Bruce. 2000 ISBN 0-7923-6579-8 A. Botinis (ed.): Intonation. Analysis, Modelling and Technology. 2000 ISBN 0-7923-6605-0 H. Bunt and A. Nijholt (eds.): Advances in Probabilistic and Other Parsing Technologies. 2000 ISBN 0-7923-6616-6 J.-C. Junqua and G. van Noord (eds.): Robustness in Languages and Speech Technology. 2001 ISBN 0-7923-6790-1 R.H. Baayen: Word Frequency Distributions. 2001 ISBN 0-7923-7017-1 B. Granstr¨om, D. House and. I. Karlsson (eds.): Multimodality in Language and Speech Systems. 2002 ISBN 1-4020-0635-7 M. Carl and A. Way (eds.): Recent Advances in Example-Based Machine Translation. 2003 ISBN 1-4020-1400-7; Pb 1-4020-1401-5 A. Abeill´e: Treebanks. Building and Using Parsed Corpora. 2003 ISBN 1-4020-1334-5; Pb 1-4020-1335-3 J. van Kuppevelt and R.W. Smith (ed.): Current and New Directions in Discourse and Dialogue. 2003 ISBN 1-4020-1614-X; Pb 1-4020-1615-8 H. Bunt, J. Carroll and G. Satta (eds.): New Developments in Parsing Technology. 2004 ISBN 1-4020-2293-X; Pb 1-4020-2294-8
Text, Speech and Language Technology 24. 25. 26. 27. 28. 29. 30. 31.
G. Fant: Speech Acoustics and Phonetics. Selected Writings. 2004 ISBN 1-4020-2373-1; Pb 1-4020-2789-3 W.J. Barry and W.A. Van Dommelen (eds.): The Integration of Phonetic Knowledge in Speech Technology. 2005 ISBN 1-4020-2635-8; Pb 1-4020-2636-6 D. Dahl (ed.): Practical Spoken Dialog Systems. 2004 ISBN 1-4020-2674-9; Pb 1-4020-2675-7 O. Stock and M. Zancanaro (eds.): Multimodal Intelligent Information Presentation. 2005 ISBN 1-4020-3049-5; Pb 1-4020-3050-9 W. Minker, D. B¨uhler and L. Dybkjaer (eds.): Spoken Multimodal Human-Computer Dialogue in Mobile Environments. 2004 ISBN 1-4020-3073-8; Pb 1-4020-3074-6 P. Saint-Dizier (ed.): Syntax and Semantics of Prepositions. 2005 ISBN 1-4020-3849-6 J. C. J. van Kuppevelt, L. Dybkjaer, N. O. Bernsen (eds.): Advances in natural Multimodal Dialogue Systems. 2005 ISBN 1-4020-3932-8 P. Grzybek (ed.): Contributions to the Science of Text and Language. Word Length Studies and Related Issues. 2006 ISBN 1-4020-4067-9
springeronline.com